The paper introduces LK Losses, a novel family of training objectives designed for speculative decoding draft models. Unlike standard approaches that minimize KL divergence as a proxy, LK Losses directly optimize the token acceptance rate, achieving significant speedups and up to 10% gains in average acceptance length across major LLM architectures like Llama-3.1, Qwen3, and DeepSeek-V3.
TL;DR
Speculative decoding's speed depends entirely on how often the "draft" model guesses correctly. Traditionally, we train these drafters using KL Divergence—but that's just a proxy. LK Losses change the game by directly optimizing the Acceptance Rate. By introducing an adaptive hybrid loss that balances stability and direct target optimization, the authors achieve up to 10% higher acceptance lengths across models like Llama-3.3 and DeepSeek-V3.
The "Proxy" Problem: Why KL Isn't Enough
In speculative decoding, a tiny "draft" model (the student) predicts tokens that a massive "target" model (the teacher) verifies. The efficiency is measured by the Acceptance Rate ().
Historically, we've used Forward KL Divergence as the training objective. Mathematically, if KL is zero, is 100%. But there's a catch: draft models are tiny. They lack the capacity to ever reach KL=0. In this "limited capacity" regime, minimizing KL might actually lead to a lower acceptance rate compared to other objectives.
As shown in the authors' motivating example, KL tends to be "mode-covering"—it spreads probability mass too thin to avoid heavy penalties, whereas the Total Variation (TV) distance (which is directly ) focuses on maximizing the overlap where it counts.
Figure 1: Traditional KL training (blue) vs LK losses (green/red). The gap widens as we try to predict more tokens.
Methodology: The Math of Direct Optimization
The authors identified that while TV distance is the "true" objective, it is a nightmare to optimize from scratch. Its gradients are , meaning they vanish in the massive vocabularies () of modern LLMs.
To solve this, they introduced two strategies:
1. The Hybrid Adaptive Loss ()
This loss uses a curriculum. It starts as Forward KL to get the model into the right "neighborhood" (providing smooth, well-scaled gradients) and then adaptively shifts toward TV optimization as the acceptance rate improves. The weight is automatically tuned using a stop-gradient on the current acceptance rate.
2. The Log-Likelihood Approach ()
By minimizing the negative log of the marginal probability of acceptance: This objective naturally scales its own gradients by . When the model is doing poorly (low ), the updates become massive, "kicking" the model toward better alignment.
Figure 2: KL (middle) spreads out to cover the whole mixture, while TV (right) smartly overlaps with the highest mass areas to maximize acceptance.
Experiments: Dominating the Leaderboards
The authors tested LK Losses across an impressive array of models, from the "small" Llama-3.1-8B to the gargantuan DeepSeek-V3 (685B).
Key Findings:
- Architecture Agnostic: Whether using EAGLE-3 (recurrent), MEDUSA (parallel heads), or DeepSeek's MTP (Multi-token prediction), LK losses consistently beat KL.
- Massive MoE Gains: The biggest jumps were seen in large Mixture-of-Experts (MoE) models. On Qwen3-235B, the acceptance length jumped by 8.2%.
- Low Capacity, High Reward: The more "restricted" the draft model is (like MEDUSA heads), the more it benefits from this specialized loss compared to generic KL.
Table: Consistent improvements across different model families and sizes.
Critical Insight: Why This Matters
The most profound takeaway is that training objectives must match the inference algorithm. For years, we've treated speculator training as a standard distillation task. This paper proves that by incorporating the "physics" of the rejection sampling algorithm into the loss function itself, we can squeeze more performance out of the same number of parameters.
Limitations & Future Work
The authors note that while they optimize per-token acceptance, they haven't yet optimized for system-level efficiency (the exact ratio of tokens accepted vs drafted). Future versions might include "top-k" constraints directly in the loss to match how LLMs are actually deployed in production.
Conclusion
LK Losses represent a "drop-in" replacement for standard speculator training. They require no extra compute during training and no changes at inference time, yet they provide a systematic boost to LLM throughput. If you are training a speculator for your custom LLM stack, standard KL is no longer the SOTA choice.
