WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[Research Frontier] OPSDC: Solving the "Verbosity Paradox" via On-Policy Self-Distillation
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches LLMs to be concise by distilling their own "be concise"-conditioned behavior. Evaluating on Qwen3-8B and Qwen3-14B, it achieves 57–59% token reduction on MATH-500 while simultaneously increasing accuracy by 9–16 points.

TL;DR

Researchers have discovered a surprising "Less is More" effect in AI reasoning. By simply teaching a model to follow its own concise instincts—a method called OPSDC—they reduced token usage by up to 59% while actually improving accuracy by double digits. This suggests that much of the "thinking" in current models is actually harmful noise.

The Problem: The High Cost of "Overthinking"

Current SOTA reasoning models (like o1 or DeepSeek-R1) are notoriously chatty. Ask a model for , and it might spend 500 tokens verifying the axioms of arithmetic. This isn't just a cost issue; it's a performance bottleneck.

Existing solutions usually fail in one of two ways:

  1. RL with Length Penalties: Forces the model to be short, often "breaking" its internal logic and causing entropy collapse (the model stops exploring alternative paths).
  2. Supervised Fine-Tuning (SFT): Teaching the model to mimic someone else's short answers, which often causes the model to "forget" its own unique reasoning style.

Methodology: The "Be Concise" Teacher

The genius of OPSDC (On-Policy Self-Distillation for Reasoning Compression) lies in its simplicity. It doesn't use rewards, ground-truth answers, or complex difficulty estimators.

The Setup:

  • The Student: The base model , acting naturally.
  • The Teacher: The same model weights, but given a "be concise" prompt .
  • The Objective: Minimize the Reverse KL Divergence between the student and teacher on student-generated rollouts.

Overall Comparison

Why Reverse KL?

The authors found that Forward KL (the standard in SFT) causes accuracy to collapse in a "saw-tooth" pattern. Reverse KL is "mode-seeking"—it encourages the student to find the concise paths the teacher prefers without being explosively unstable. Combined with a periodic teacher update (refreshing the teacher's weights every steps), the model progressively "squeezes" its reasoning traces.

Key Results: Accuracy through Brevity

The results on Qwen3 models are striking. On the MATH-500 benchmark, the models didn't just maintain performance—they skyrocketed.

  • Qwen3-14B: Accuracy jumped from 70.0% to 86.1% while the response length shrank by 56.5%.
  • AIME 2024: A gain of 10.5 percentage points with a 41% reduction in tokens.

Training Accuracy Growth

The "Difficulty-Adaptive" Feature

Unlike RL methods that need a separate classifier to know when a problem is "hard," OPSDC is naturally adaptive. For easy problems, the "concise" teacher provides a very short target (strong signal). For hard problems, even a concise teacher needs to deliberate (weak signal). Thus, the model automatically knows when to think hard and when to shut up.

Deep Insight: Why Does Less Token Usage Help?

The authors propose a "Compounding Error" theory. In autoregressive reasoning, every token is a point of failure. If the model talks too much, it increases the probability of "hallucinating" a wrong intermediate step that cascades into a wrong final answer.

"Verbosity is not caution—it is a source of error."

By stripping away the "noise" (the "Wait, let me double check..." or redundant re-derivations), the model stays on the golden path of logic.

Entropy Preservation Note: Unlike RL, OPSDC preserves model entropy (exploratory capacity) throughout training.

Conclusion & Future Work

OPSDC proves that reasoning models already know how to be efficient; they just need to be taught to trust their concise modes. By moving away from ground-truth-dependent RL and toward behavioral self-distillation, we can build models that are not only 2-3x cheaper to run but also fundamentally smarter.

The next frontier? Applying this to non-math domains like coding and legal reasoning, where "correctness" is harder to define but conciseness is just as valuable.

Find Similar Papers

Try Our Examples

  • Examine recent papers that utilize on-policy self-distillation to mitigate "overthinking" or "chain-of-thought verbosity" in large reasoning models.
  • What is the theoretical origin of the "compounding error" problem in autoregressive reasoning, and how does token length reduction mathematically mitigate this?
  • Explore if the OPSDC framework has been applied to non-mathematical reasoning domains such as code generation or multi-step legal document analysis.
Contents
[Research Frontier] OPSDC: Solving the "Verbosity Paradox" via On-Policy Self-Distillation
1. TL;DR
2. The Problem: The High Cost of "Overthinking"
3. Methodology: The "Be Concise" Teacher
3.1. The Setup:
3.2. Why Reverse KL?
4. Key Results: Accuracy through Brevity
4.1. The "Difficulty-Adaptive" Feature
5. Deep Insight: Why Does Less Token Usage Help?
6. Conclusion & Future Work