WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[Analysis] Why Self-Distillation Can Be a "Reasoning Killer" for LLMs
总结
问题
方法
结果
要点
摘要

This paper investigates why self-distillation, a popular LLM post-training paradigm, often degrades mathematical reasoning despite improving performance in other domains. It identifies the suppression of "epistemic verbalization"—the explicit expression of uncertainty (e.g., "Wait," "Hmm")—as the root cause of this failure, particularly in out-of-distribution (OOD) scenarios.

TL;DR

Self-distillation is often hailed as a "free lunch" for making LLMs faster and smarter. However, this paper reveals a dark side: in mathematical reasoning, self-distillation (SD) often forces models to be "overconfident" by suppressing epistemic verbalization—those tiny "Hmms" and "Waits" that allow a model to double-check its logic. This leads to a massive performance collapse (up to 40%) in challenging out-of-distribution (OOD) tasks like AIME.

The "Conciseness" Trap: Problem & Motivation

In domains like Chemistry or simple tool-use, self-distillation helps models reach the right answer with fewer steps. The logic seems sound: if a model can see the answer during training (as a teacher), it can teach itself the shortest path.

But the authors noticed a disturbing trend in Math. As the models were trained via SD, their response lengths dropped—but so did their accuracy. The researchers hypothesized that the teacher model, by having access to the ground truth, generates "too perfect" trajectories. It skips the "thinking" process where a model weighs different possibilities. When the student imitates this, it loses the ability to handle uncertainty when the ground truth is not available at inference time.

Methodology: The Cost of Information

The authors use an information-theoretic lens to explain this. They define the "informativeness" of a context $c$. When the teacher is conditioned on the full solution, it reduces the model's entropy (uncertainty) to nearly zero.

Teacher-Guided Reasoning Behavior

As shown in the graph above, as the information given to the teacher increases:

  1. Response Length decreases.
  2. Epistemic Token Count (markers like "perhaps", "actually", "check") vanishes.

This "epistemic verbalization" acts as a self-correction mechanism. By removing it, the model commits to a path too early. If that path is wrong, it has no linguistic "handles" to pivot back to a correct solution.

Experimental Evidence: GRPO vs SDPO

The study compared GRPO (which uses rewards but no self-distillation) with SDPO (Reinforcement Learning via Self-Distillation).

Training Score vs Length

On the DeepSeek-R1-Distill-Qwen-7B model:

  • GRPO slightly increased length and improved performance.
  • SDPO (c=s, full solution) saw a "death spiral" where the model became incredibly concise but lost nearly 40% of its accuracy on the AIME24 benchmark.

The authors also found that Task Coverage matters. If you only train on a few types of problems, self-distillation looks great because the model can just memorize the "shortcuts." But as soon as you move to a diverse set of 14,000+ problems, the lack of uncertainty markers leaves the model fragile and unable to generalize.

Deep Insights: The Fixed vs. Moving Teacher

A fascinating technical nuance found in the ablation studies: using a fixed teacher (the initial model) is actually better than a moving target. In a moving target setup (where the teacher is an EMA of the student), a feedback loop occurs. The model becomes more confident, the teacher becomes even more confident, and eventually, all the "reasoning" is distilled away into just a guess.

Ablation on Learning Trends

Conclusion & Future Outlook

The core takeaway is a warning to the AI community: Conciseness does not equal Intelligence.

While we want our models to be efficient, "compressed" reasoning is only safe when the task is repetitive. For frontier reasoning, we must protect the model's right to be "uncertain." Future post-training recipes should focus on:

  • Preserving epistemic markers (the "Wait... actually" moments).
  • Lowering the "Information Richness" of the teacher context to keep the training task difficult enough to require real thinking.
  • Prioritizing OOD generalization over in-domain training scores.

This work serves as a vital course correction for the next generation of reasoning models, ensuring they remain robust thinkers rather than just confident guessers.

发现相似论文

试试这些示例

  • Search for recent papers that analyze the trade-off between chain-of-thought length and reasoning accuracy in Large Language Models.
  • Which study first defined "epistemic verbalization" in the context of LLMs, and how does it relate to the "Self-Bayesian" reasoning framework?
  • Explore research applying RLVR (Reinforcement Learning from Verifiable Rewards) to domains outside of math and science to see if epistemic suppression occurs there too.
目录
[Analysis] Why Self-Distillation Can Be a "Reasoning Killer" for LLMs
1. TL;DR
2. The "Conciseness" Trap: Problem & Motivation
3. Methodology: The Cost of Information
4. Experimental Evidence: GRPO vs SDPO
5. Deep Insights: The Fixed vs. Moving Teacher
6. Conclusion & Future Outlook