The paper introduces OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches LLMs to be concise by distilling their own "be concise"-conditioned behavior. Evaluating on Qwen3-8B and Qwen3-14B, it achieves 57–59% token reduction on MATH-500 while simultaneously increasing accuracy by 9–16 points.
TL;DR
Researchers have discovered a surprising "Less is More" effect in AI reasoning. By simply teaching a model to follow its own concise instincts—a method called OPSDC—they reduced token usage by up to 59% while actually improving accuracy by double digits. This suggests that much of the "thinking" in current models is actually harmful noise.
The Problem: The High Cost of "Overthinking"
Current SOTA reasoning models (like o1 or DeepSeek-R1) are notoriously chatty. Ask a model for , and it might spend 500 tokens verifying the axioms of arithmetic. This isn't just a cost issue; it's a performance bottleneck.
Existing solutions usually fail in one of two ways:
- RL with Length Penalties: Forces the model to be short, often "breaking" its internal logic and causing entropy collapse (the model stops exploring alternative paths).
- Supervised Fine-Tuning (SFT): Teaching the model to mimic someone else's short answers, which often causes the model to "forget" its own unique reasoning style.
Methodology: The "Be Concise" Teacher
The genius of OPSDC (On-Policy Self-Distillation for Reasoning Compression) lies in its simplicity. It doesn't use rewards, ground-truth answers, or complex difficulty estimators.
The Setup:
- The Student: The base model , acting naturally.
- The Teacher: The same model weights, but given a "be concise" prompt .
- The Objective: Minimize the Reverse KL Divergence between the student and teacher on student-generated rollouts.

Why Reverse KL?
The authors found that Forward KL (the standard in SFT) causes accuracy to collapse in a "saw-tooth" pattern. Reverse KL is "mode-seeking"—it encourages the student to find the concise paths the teacher prefers without being explosively unstable. Combined with a periodic teacher update (refreshing the teacher's weights every steps), the model progressively "squeezes" its reasoning traces.
Key Results: Accuracy through Brevity
The results on Qwen3 models are striking. On the MATH-500 benchmark, the models didn't just maintain performance—they skyrocketed.
- Qwen3-14B: Accuracy jumped from 70.0% to 86.1% while the response length shrank by 56.5%.
- AIME 2024: A gain of 10.5 percentage points with a 41% reduction in tokens.

The "Difficulty-Adaptive" Feature
Unlike RL methods that need a separate classifier to know when a problem is "hard," OPSDC is naturally adaptive. For easy problems, the "concise" teacher provides a very short target (strong signal). For hard problems, even a concise teacher needs to deliberate (weak signal). Thus, the model automatically knows when to think hard and when to shut up.
Deep Insight: Why Does Less Token Usage Help?
The authors propose a "Compounding Error" theory. In autoregressive reasoning, every token is a point of failure. If the model talks too much, it increases the probability of "hallucinating" a wrong intermediate step that cascades into a wrong final answer.
"Verbosity is not caution—it is a source of error."
By stripping away the "noise" (the "Wait, let me double check..." or redundant re-derivations), the model stays on the golden path of logic.
Note: Unlike RL, OPSDC preserves model entropy (exploratory capacity) throughout training.
Conclusion & Future Work
OPSDC proves that reasoning models already know how to be efficient; they just need to be taught to trust their concise modes. By moving away from ground-truth-dependent RL and toward behavioral self-distillation, we can build models that are not only 2-3x cheaper to run but also fundamentally smarter.
The next frontier? Applying this to non-math domains like coding and legal reasoning, where "correctness" is harder to define but conciseness is just as valuable.
