This paper introduces Learning to Draft (LTD), an RL-based framework for adaptive speculative decoding that optimizes the inference throughput of Large Language Models (LLMs). By training co-adaptive depth and size policies, LTD achieves SOTA speedup ratios of 2.24x to 4.32x, outperforming the previous best-in-class Eagle3 by up to 36.4%.
TL;DR
Learning to Draft (LTD) is a reinforcement learning framework that solves the "over-drafting" problem in LLM inference. By optimizing for wall-clock throughput instead of just token acceptance length, LTD achieves up to 4.32x speedup on modern LLMs, smashing previous SOTA benchmarks like Eagle3.
The Motivation: Acceptance Length is a False Idol
Speculative decoding typically works by using a small "draft" model to guess tokens and a "target" model to verify them. For years, the community has obsessed over maximizing Acceptance Length ($ au$)—the number of tokens the target model accepts per cycle.
However, the authors of LTD make a critical observation: A longer draft isn't always a faster draft. If a draft model spends 50ms building a complex tree and the target model spends 100ms verifying it, but you only gain 4 tokens, you might have been faster just doing standard auto-regressive decoding. Prior works like Eagle3 used static heuristics that failed to account for this time-cost trade-off.
Methodology: Co-Adaptive Reinforcement Learning
LTD refines the "Draft-and-Verify" cycle by treating it as an RL environment. It introduces two lightweight MLP policies that act as "traffic controllers" for the inference engine:
- Depth Policy ($\pi_D$): Decides after every step of the draft model whether to
CONTINUEdrafting orSTOP. It senses the "uncertainty" in the current path. - Size Policy ($\pi_V$): Once drafting is done, it picks the optimal number of candidate tokens ($V$) to send for verification, balancing the target model's compute cost.
The Secret Sauce: Real-Time Throughput Reward
Instead of rewarding the model for being "correct," LTD rewards the model for being fast. The reward signal is: $$R_t = L_A / (T_{draft} + T_{verify})$$ This forces the policies to learn when the draft model is "yapping" pointlessly and when the target model should invest more time in verification.

Experiments: Crushing the Baselines
The authors tested LTD on a range of models, including the new Qwen3-32B and DeepSeek-R1-Distill-Llama-8B.
- Massive Gains on Large Models: On Qwen3-32B, LTD achieved a 36.4% improvement over the previous SOTA.
- Zero-Shot Robustness: Although trained on code (HumanEval), the model generalized to 54/57 MMLU tasks, showing it learned universal "difficulty patterns" in language rather than just memorizing text.
- High Temperature Stability: Most dynamic methods break at $T=1.0$ because randomness makes draft models unreliable. LTD remains robust, outperforming others by maintaining a conservative drafting strategy when entropy is high.

Deep Insight: Why Iterative Training Matters?
The paper highlights that drafting and verification are interdependent. If you optimize the draft model in isolation, it might generate long sequences that the verification stage (if fixed) cannot handle efficiently.
LTD uses Iterative Optimization:
- Freeze Size Policy, train Depth Policy.
- Freeze Depth Policy, train Size Policy. This "dance" allows the two policies to co-adapt, leading to a synergistic system where the draft model only works as hard as the verification stage needs it to.
Conclusion and Future Outlook
LTD represents a shift from "AI for Language" to "AI for Systems." By making the inference engine aware of its own clock, we move closer to truly efficient real-time reasoning.
Limitations: The current policies are MLPs. While fast, they rely on token probabilities as a proxy for "difficulty." Future work might involve deeper integration with the LLM's internal states to predict "verification difficulty" even more accurately.
Senior Editor's Note: This paper is a masterclass in hardware-aware AI. By focusing on the denominator of the speedup equation (time), the authors achieved what others couldn't by only looking at the numerator (tokens).
