Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

WisPaper

学术搜索

学术问答

论文订阅

价格

TrueCite

工作空间

Home

Blog

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

[ICLR 2025] Learning to Draft: Breakthrough Speculative Decoding via Co-Adaptive RL

总结

问题

方法

结果

要点

摘要

This paper introduces Learning to Draft (LTD), an RL-based framework for adaptive speculative decoding that optimizes the inference throughput of Large Language Models (LLMs). By training co-adaptive depth and size policies, LTD achieves SOTA speedup ratios of 2.24x to 4.32x, outperforming the previous best-in-class Eagle3 by up to 36.4%.

TL;DR

Learning to Draft (LTD) is a reinforcement learning framework that solves the "over-drafting" problem in LLM inference. By optimizing for wall-clock throughput instead of just token acceptance length, LTD achieves up to 4.32x speedup on modern LLMs, smashing previous SOTA benchmarks like Eagle3.

The Motivation: Acceptance Length is a False Idol

Speculative decoding typically works by using a small "draft" model to guess tokens and a "target" model to verify them. For years, the community has obsessed over maximizing Acceptance Length ($ au$)—the number of tokens the target model accepts per cycle.

However, the authors of LTD make a critical observation: A longer draft isn't always a faster draft. If a draft model spends 50ms building a complex tree and the target model spends 100ms verifying it, but you only gain 4 tokens, you might have been faster just doing standard auto-regressive decoding. Prior works like Eagle3 used static heuristics that failed to account for this time-cost trade-off.

Methodology: Co-Adaptive Reinforcement Learning

LTD refines the "Draft-and-Verify" cycle by treating it as an RL environment. It introduces two lightweight MLP policies that act as "traffic controllers" for the inference engine:

Depth Policy ($\pi_D$): Decides after every step of the draft model whether to CONTINUE drafting or STOP. It senses the "uncertainty" in the current path.
Size Policy ($\pi_V$): Once drafting is done, it picks the optimal number of candidate tokens ($V$) to send for verification, balancing the target model's compute cost.

The Secret Sauce: Real-Time Throughput Reward

Instead of rewarding the model for being "correct," LTD rewards the model for being fast. The reward signal is: $$R_t = L_A / (T_{draft} + T_{verify})$$ This forces the policies to learn when the draft model is "yapping" pointlessly and when the target model should invest more time in verification.

The overview of LTD method

Experiments: Crushing the Baselines

The authors tested LTD on a range of models, including the new Qwen3-32B and DeepSeek-R1-Distill-Llama-8B.

Massive Gains on Large Models: On Qwen3-32B, LTD achieved a 36.4% improvement over the previous SOTA.
Zero-Shot Robustness: Although trained on code (HumanEval), the model generalized to 54/57 MMLU tasks, showing it learned universal "difficulty patterns" in language rather than just memorizing text.
High Temperature Stability: Most dynamic methods break at $T=1.0$ because randomness makes draft models unreliable. LTD remains robust, outperforming others by maintaining a conservative drafting strategy when entropy is high.

Analysis of Draft and Verification Time

Deep Insight: Why Iterative Training Matters?

The paper highlights that drafting and verification are interdependent. If you optimize the draft model in isolation, it might generate long sequences that the verification stage (if fixed) cannot handle efficiently.

LTD uses Iterative Optimization:

Freeze Size Policy, train Depth Policy.
Freeze Depth Policy, train Size Policy. This "dance" allows the two policies to co-adapt, leading to a synergistic system where the draft model only works as hard as the verification stage needs it to.

Conclusion and Future Outlook

LTD represents a shift from "AI for Language" to "AI for Systems." By making the inference engine aware of its own clock, we move closer to truly efficient real-time reasoning.

Limitations: The current policies are MLPs. While fast, they rely on token probabilities as a proxy for "difficulty." Future work might involve deeper integration with the LLM's internal states to predict "verification difficulty" even more accurately.

Senior Editor's Note: This paper is a masterclass in hardware-aware AI. By focusing on the denominator of the speedup equation (time), the authors achieved what others couldn't by only looking at the numerator (tokens).

发现相似论文

试试这些示例

Search for recent papers published after 2024 that utilize reinforcement learning to optimize the serving throughput of Large Language Models beyond speculative decoding.
What is the theoretical origin of the Eagle3 framework, and how did its predecessors (Eagle and Eagle-2) handle terminal node selection?
Investigate if there are studies applying adaptive tree-based speculative decoding to multimodal or vision-language models like LLaVA or Qwen-VL.

[ICLR 2025] Learning to Draft: Breakthrough Speculative Decoding via Co-Adaptive RL

1. TL;DR

2. The Motivation: Acceptance Length is a False Idol

3. Methodology: Co-Adaptive Reinforcement Learning

3.1. The Secret Sauce: Real-Time Throughput Reward

4. Experiments: Crushing the Baselines

5. Deep Insight: Why Iterative Training Matters?

6. Conclusion and Future Outlook