On-Policy Self-Distillation for Reasoning Compression

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

On-Policy Self-Distillation for Reasoning Compression

[Research Frontier] OPSDC: Solving the "Verbosity Paradox" via On-Policy Self-Distillation

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches LLMs to be concise by distilling their own "be concise"-conditioned behavior. Evaluating on Qwen3-8B and Qwen3-14B, it achieves 57–59% token reduction on MATH-500 while simultaneously increasing accuracy by 9–16 points.

TL;DR

Researchers have discovered a surprising "Less is More" effect in AI reasoning. By simply teaching a model to follow its own concise instincts—a method called OPSDC—they reduced token usage by up to 59% while actually improving accuracy by double digits. This suggests that much of the "thinking" in current models is actually harmful noise.

The Problem: The High Cost of "Overthinking"

Current SOTA reasoning models (like o1 or DeepSeek-R1) are notoriously chatty. Ask a model for $2 + 2$ , and it might spend 500 tokens verifying the axioms of arithmetic. This isn't just a cost issue; it's a performance bottleneck.

Existing solutions usually fail in one of two ways:

RL with Length Penalties: Forces the model to be short, often "breaking" its internal logic and causing entropy collapse (the model stops exploring alternative paths).
Supervised Fine-Tuning (SFT): Teaching the model to mimic someone else's short answers, which often causes the model to "forget" its own unique reasoning style.

Methodology: The "Be Concise" Teacher

The genius of OPSDC (On-Policy Self-Distillation for Reasoning Compression) lies in its simplicity. It doesn't use rewards, ground-truth answers, or complex difficulty estimators.

The Setup:

The Student: The base model $π_{h} e t a (\cdot ∣ x)$ , acting naturally.
The Teacher: The same model weights, but given a "be concise" prompt $c$ .
The Objective: Minimize the Reverse KL Divergence between the student and teacher on student-generated rollouts.

Overall Comparison

Why Reverse KL?

The authors found that Forward KL (the standard in SFT) causes accuracy to collapse in a "saw-tooth" pattern. Reverse KL is "mode-seeking"—it encourages the student to find the concise paths the teacher prefers without being explosively unstable. Combined with a periodic teacher update (refreshing the teacher's weights every $M$ steps), the model progressively "squeezes" its reasoning traces.

Key Results: Accuracy through Brevity

The results on Qwen3 models are striking. On the MATH-500 benchmark, the models didn't just maintain performance—they skyrocketed.

Qwen3-14B: Accuracy jumped from 70.0% to 86.1% while the response length shrank by 56.5%.
AIME 2024: A gain of 10.5 percentage points with a 41% reduction in tokens.

Training Accuracy Growth

The "Difficulty-Adaptive" Feature

Unlike RL methods that need a separate classifier to know when a problem is "hard," OPSDC is naturally adaptive. For easy problems, the "concise" teacher provides a very short target (strong signal). For hard problems, even a concise teacher needs to deliberate (weak signal). Thus, the model automatically knows when to think hard and when to shut up.

Deep Insight: Why Does Less Token Usage Help?

The authors propose a "Compounding Error" theory. In autoregressive reasoning, every token is a point of failure. If the model talks too much, it increases the probability of "hallucinating" a wrong intermediate step that cascades into a wrong final answer.

"Verbosity is not caution—it is a source of error."

By stripping away the "noise" (the "Wait, let me double check..." or redundant re-derivations), the model stays on the golden path of logic.

Entropy Preservation Note: Unlike RL, OPSDC preserves model entropy (exploratory capacity) throughout training.

Conclusion & Future Work

OPSDC proves that reasoning models already know how to be efficient; they just need to be taught to trust their concise modes. By moving away from ground-truth-dependent RL and toward behavioral self-distillation, we can build models that are not only 2-3x cheaper to run but also fundamentally smarter.

The next frontier? Applying this to non-math domains like coding and legal reasoning, where "correctness" is harder to define but conciseness is just as valuable.

Find Similar Papers

Try Our Examples

Examine recent papers that utilize on-policy self-distillation to mitigate "overthinking" or "chain-of-thought verbosity" in large reasoning models.
What is the theoretical origin of the "compounding error" problem in autoregressive reasoning, and how does token length reduction mathematically mitigate this?
Explore if the OPSDC framework has been applied to non-mathematical reasoning domains such as code generation or multi-step legal document analysis.

Contents

[Research Frontier] OPSDC: Solving the "Verbosity Paradox" via On-Policy Self-Distillation

1. TL;DR

2. The Problem: The High Cost of "Overthinking"

3. Methodology: The "Be Concise" Teacher

3.1. The Setup:

3.2. Why Reverse KL?

4. Key Results: Accuracy through Brevity

4.1. The "Difficulty-Adaptive" Feature

5. Deep Insight: Why Does Less Token Usage Help?

6. Conclusion & Future Work