WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[ICML 2025 Prediction] LK Losses: Bridging the Proxy Gap in Speculative Decoding
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces LK Losses, a novel family of training objectives designed for speculative decoding draft models. Unlike standard approaches that minimize KL divergence as a proxy, LK Losses directly optimize the token acceptance rate, achieving significant speedups and up to 10% gains in average acceptance length across major LLM architectures like Llama-3.1, Qwen3, and DeepSeek-V3.

TL;DR

Speculative decoding's speed depends entirely on how often the "draft" model guesses correctly. Traditionally, we train these drafters using KL Divergence—but that's just a proxy. LK Losses change the game by directly optimizing the Acceptance Rate. By introducing an adaptive hybrid loss that balances stability and direct target optimization, the authors achieve up to 10% higher acceptance lengths across models like Llama-3.3 and DeepSeek-V3.

The "Proxy" Problem: Why KL Isn't Enough

In speculative decoding, a tiny "draft" model (the student) predicts tokens that a massive "target" model (the teacher) verifies. The efficiency is measured by the Acceptance Rate ().

Historically, we've used Forward KL Divergence as the training objective. Mathematically, if KL is zero, is 100%. But there's a catch: draft models are tiny. They lack the capacity to ever reach KL=0. In this "limited capacity" regime, minimizing KL might actually lead to a lower acceptance rate compared to other objectives.

As shown in the authors' motivating example, KL tends to be "mode-covering"—it spreads probability mass too thin to avoid heavy penalties, whereas the Total Variation (TV) distance (which is directly ) focuses on maximizing the overlap where it counts.

Acceptance Length vs Draft Length Figure 1: Traditional KL training (blue) vs LK losses (green/red). The gap widens as we try to predict more tokens.

Methodology: The Math of Direct Optimization

The authors identified that while TV distance is the "true" objective, it is a nightmare to optimize from scratch. Its gradients are , meaning they vanish in the massive vocabularies () of modern LLMs.

To solve this, they introduced two strategies:

1. The Hybrid Adaptive Loss ()

This loss uses a curriculum. It starts as Forward KL to get the model into the right "neighborhood" (providing smooth, well-scaled gradients) and then adaptively shifts toward TV optimization as the acceptance rate improves. The weight is automatically tuned using a stop-gradient on the current acceptance rate.

2. The Log-Likelihood Approach ()

By minimizing the negative log of the marginal probability of acceptance: This objective naturally scales its own gradients by . When the model is doing poorly (low ), the updates become massive, "kicking" the model toward better alignment.

Geometric Intuition of TV vs KL Figure 2: KL (middle) spreads out to cover the whole mixture, while TV (right) smartly overlaps with the highest mass areas to maximize acceptance.

Experiments: Dominating the Leaderboards

The authors tested LK Losses across an impressive array of models, from the "small" Llama-3.1-8B to the gargantuan DeepSeek-V3 (685B).

Key Findings:

  • Architecture Agnostic: Whether using EAGLE-3 (recurrent), MEDUSA (parallel heads), or DeepSeek's MTP (Multi-token prediction), LK losses consistently beat KL.
  • Massive MoE Gains: The biggest jumps were seen in large Mixture-of-Experts (MoE) models. On Qwen3-235B, the acceptance length jumped by 8.2%.
  • Low Capacity, High Reward: The more "restricted" the draft model is (like MEDUSA heads), the more it benefits from this specialized loss compared to generic KL.

Performance across different models Table: Consistent improvements across different model families and sizes.

Critical Insight: Why This Matters

The most profound takeaway is that training objectives must match the inference algorithm. For years, we've treated speculator training as a standard distillation task. This paper proves that by incorporating the "physics" of the rejection sampling algorithm into the loss function itself, we can squeeze more performance out of the same number of parameters.

Limitations & Future Work

The authors note that while they optimize per-token acceptance, they haven't yet optimized for system-level efficiency (the exact ratio of tokens accepted vs drafted). Future versions might include "top-k" constraints directly in the loss to match how LLMs are actually deployed in production.

Conclusion

LK Losses represent a "drop-in" replacement for standard speculator training. They require no extra compute during training and no changes at inference time, yet they provide a systematic boost to LLM throughput. If you are training a speculator for your custom LLM stack, standard KL is no longer the SOTA choice.

Find Similar Papers

Try Our Examples

  • Examine recent papers that optimize non-differentiable inference metrics in Large Language Models using surrogate loss functions or Reinforcement Learning.
  • Who first formally established the relationship between Total Variation distance and the acceptance rate in speculative decoding, and how does this paper's implementation differ?
  • Investigate the application of LK Losses or similar acceptance-rate optimization techniques in non-text modalities like Speculative Video or Audio Generation.
Contents
[ICML 2025 Prediction] LK Losses: Bridging the Proxy Gap in Speculative Decoding
1. TL;DR
2. The "Proxy" Problem: Why KL Isn't Enough
3. Methodology: The Math of Direct Optimization
3.1. 1. The Hybrid Adaptive Loss ($L^\lambda_{LK}$)
3.2. 2. The Log-Likelihood Approach ($L^\alpha_{LK}$)
4. Experiments: Dominating the Leaderboards
4.1. Key Findings:
5. Critical Insight: Why This Matters
5.1. Limitations & Future Work
6. Conclusion