This paper addresses the instability of On-Policy Distillation (OPD) for Large Language Models (LLMs) by replacing the traditional "sampled-token" objective with a "Teacher Top-K Local Support Matching" method. This approach achieves SOTA-level improvements in long-horizon reasoning, boosting math benchmarks (e.g., AIME24) while maintaining agentic performance.
TL;DR
On-Policy Distillation (OPD) is the "gold standard" for teaching LLMs to reason by letting them learn from their own mistakes. However, the standard implementation—which only rewards/punishes the specific token the student sampled—is notoriously brittle. This paper revisits OPD's foundations, identifies why "one-token signals" fail in long sequences, and introduces Teacher Top-K Local Support Matching. The result? More stable training, higher math reasoning scores, and a robust defense against "reward hacking" loops.
The Core Conflict: Bias vs. Variance
The authors start by analyzing the estimator behind OPD. Theoretically, we want to minimize the reverse-KL at the sequence level. However, a sequence-level estimator couples every token update to every future reward, leading to a variance that grows at $O(T^4)$.
To keep training stable, most practitioners use token-level OPD, which drops future rewards to achieve $O(T^2)$ variance. While this introduces bias, it is a necessary evil for long-horizon tasks like multi-step math or agentic workflows. The problem isn't the locality—it's the sparsity of the signal.
Three Reasons Why Sampled-Token OPD Fails
- Imbalanced Signal: Most sampled tokens receive negative rewards, concentrating the positive learning signal on a tiny subset of tokens, making optimization hypersensitive.
- Teacher Unreliability: When a student drifts into a "weird" prefix (e.g., a repetition loop), the teacher might still assign high probability to plausible-looking tokens, failing to provide the "anchor" needed to pull the student back.
- Tokenizer Incompatibility: If the teacher and student use different tokenizers (e.g., Qwen vs. Llama), a "correct" semantic token might be heavily penalized simply because it's segmented differently.
Figure: The log-probability gap widens as sequences get longer, showing how the teacher signal degrades on student-generated rollouts.
The Solution: Local Support Matching (LSM)
Instead of checking if the teacher likes the one token the student picked, the method evaluates the student's distribution against the teacher's Top-K most likely tokens at that prefix.
The Recipe:
- Truncated KL: Calculate the KL-divergence only over the Top-K tokens defined by the teacher.
- Renormalization: Re-normalize the probabilities within this Top-K set so they sum to 1, ensuring the distributions are comparable.
- Top-p Rollouts: Use nucleus sampling to keep the student from wandering into "garbage" states early on.
- Special-Token Masking: Ignore disagreements on tokens like
<think>or<|endoftext|>to avoid tokenizer artifacts.
Figure: Compared to the baseline, LSM (ours) shows much lower gradient norms and fewer instances of gradient clipping, indicating a significantly "smoother" optimization landscape.
Experimental Victories
The authors tested this on Qwen2.5-7B for math and agentic tasks (ALFWorld).
- Math Reasoning: The model climbed from a 36.4 average to 41.5. On the challenging AIME24 benchmark, the improvement was even more pronounced.
- Agentic Performance: In multi-task settings, the model hit a 97.7% success rate on ALFWorld, proving that stabilizing math reasoning doesn't have to come at the cost of general task performance.
- Stability: The "Repetition Loop" failure—where a model gets stuck saying "Wait, Wait, Wait"—was significantly reduced because the distribution-level loss forces the model to match the teacher's entropy rather than just one-off token choices.
Critical Insight: The Middle Ground
The genius of this work lies in finding the "Goldilocks zone." Fully sequence-level RL is too noisy; single-token distillation is too "dumb." By matching distributions over a local teacher support, the authors provide enough signal to guide the student without the chaotic variance of long-term reward dependencies.
Limitations & Future Work
While LSM is a powerful "simple fix," the authors admit a gap still exists between the student and the teacher. Future research will likely focus on outcome-verifiable rewards (like code execution) to complement this teacher-matching approach, ensuring that models don't just "talk like the teacher" but actually "solve the problem."
Summary Takeaway
If you are training reasoning models using OPD, stop using point-estimate rewards. Moving to a truncated, renormalized Top-K KL objective is a low-cost, high-impact upgrade that solves the most common stability issues in LLM post-training.
