WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[ICLR 2026] Demystifying When Pruning Works via Representation Hierarchies
总结
问题
方法
结果
要点
摘要

This paper investigates the inconsistent performance of pruned Large Language Models (LLMs) across different tasks. It identifies that while pruned models remain robust in non-generative tasks (e.g., retrieval, classification), they often collapse in generative settings, achieving state-of-the-art diagnostic insights into why representation hierarchies determine pruning success.

Executive Summary

TL;DR: Why does a pruned LLM still excel at Multiple-Choice questions but fail miserably at writing a simple story? This paper reveals a "Representation Hierarchy" where pruning-induced errors are largely ignored by latent embeddings and logits but are catastrophically amplified by the nonlinear Softmax function and the "butterfly effect" of autoregressive generation.

The study positions itself as a critical diagnostic work, moving beyond "how much we can prune" to "where exactly the signal is lost" in the LLM pipeline.

The "Generative Collapse" Mystery

In the landscape of model compression, pruning—the removal of redundant parameters—is often treated as a universal gain in efficiency. However, the authors highlight a jarring inconsistency:

  • The Success Case: On non-generative tasks like MMLU or retrieval (E5-Mistral), the model acts as if nothing happened even after losing significant layers.
  • The Failure Case: In generative tasks like GSM8K (math reasoning) or HumanEval (coding), the performance often drops to near zero.

The motivation is simple: if we can understand why this gap exists, we can design models that are "pruning-aware" or apply pruning only where it won't hurt.

Methodology: The Three Spaces of Inference

The authors analyze the propagation of pruning noise across three distinct stages:

  1. Embedding Space ($h$): Hidden states across Transformer layers.
  2. Logit Space ($z$): The output of the final linear head ($W h$).
  3. Probability Space ($p$): The post-softmax distribution.

1. The Robustness of Similarity

The authors use 2nd-order Taylor expansion to prove that the linear LM Head actually preserves similarity. Because the projection is linear, it tends to attenuate the orthogonal components of noise.

Propagation of Perturbations Figure 1: Pruning noise $\Delta h$ is stable in logits but explodes in the probability space.

2. The Softmax "Magnifying Glass"

The core insight is found in Theorem 2 and 3. The Softmax function is highly sensitive to the variance of logit perturbations. Even if the logits are 99% similar to the original, a small shift in the weighted variance $ ext{Var}_r(\Delta z)$ results in a massive shift in the probability distribution. This is quantified via KL Divergence, which accurately predicts the generation collapse.

Multi-Scale Effects: Why Non-Generative Tasks Survive

The paper introduces two brilliant explanations for the survival of non-generative tasks:

  • Subspace Stability: Multiple-choice tasks only care about the relative order of a few tokens (A, B, C, D). The authors show that while the global distribution is ruined, the "Categorical-token probability subspace" remains remarkably stable.
  • Temporal Compounding: In generation, an error at Step 1 is fed back into the model as part of the context for Step 2. This feedback loop (Self-Attention) acts as an accumulator for pruning noise. Non-generative tasks are "one-shot" and skip this recursive catastrophe.

Experimental Validation

Using Mistral-7B and Qwen-2.5, the authors demonstrated that dropping layers (inter-layer) or using Wanda (intra-layer sparsification) followed the same rule.

Performance Discrepancy Table 1: Note the contrast—Average MMLU (Non-generative) drops marginally (-5%), while GSM8K (Generative) collapses entirely (-100%).

The authors also compared pruning with Quantization. Interestingly, Quantization (e.g., AWQ) shows much higher similarity and lower KL Divergence in early steps, explaining why 4-bit models generate much better than 50%-pruned models despite having similar memory footprints.

Practical Insight & Conclusion

Takeaways for Engineers:

  • Evaluation Matters: Don't use MMLU/Classification scores as a proxy for how well your pruned model will chat or code.
  • Target the Task: If your application is a vector database (retrieval) or a classifier, you can prune aggressively with almost no loss.
  • Generation needs Post-Training: Training-free pruning is almost "illegal" for generative tasks. Fine-tuning is mandatory to recalibrate the Softmax variance.

In conclusion, the bottleneck of pruning isn't the "loss of knowledge" in the embeddings—it's the nonlinear sensitivity of the probability output and the accumulation of errors over time.

发现相似论文

试试这些示例

  • Which recent papers analyze the sensitivity of the Softmax layer and temperature scaling to parameter perturbations in compressed LLMs?
  • What are the original theoretical foundations for representation similarity and layer-wise angular deviation in deep neural network pruning (e.g., Gromov et al., 2024)?
  • How have error propagation mitigation techniques been applied to autoregressive generation in models compressed via quantization or sparsification?
目录
[ICLR 2026] Demystifying When Pruning Works via Representation Hierarchies
1. Executive Summary
2. The "Generative Collapse" Mystery
3. Methodology: The Three Spaces of Inference
3.1. 1. The Robustness of Similarity
3.2. 2. The Softmax "Magnifying Glass"
4. Multi-Scale Effects: Why Non-Generative Tasks Survive
5. Experimental Validation
6. Practical Insight & Conclusion
6.1. Takeaways for Engineers: