This paper investigates the inconsistent performance of pruned Large Language Models (LLMs) across different tasks. It identifies that while pruned models remain robust in non-generative tasks (e.g., retrieval, classification), they often collapse in generative settings, achieving state-of-the-art diagnostic insights into why representation hierarchies determine pruning success.
Executive Summary
TL;DR: Why does a pruned LLM still excel at Multiple-Choice questions but fail miserably at writing a simple story? This paper reveals a "Representation Hierarchy" where pruning-induced errors are largely ignored by latent embeddings and logits but are catastrophically amplified by the nonlinear Softmax function and the "butterfly effect" of autoregressive generation.
The study positions itself as a critical diagnostic work, moving beyond "how much we can prune" to "where exactly the signal is lost" in the LLM pipeline.
The "Generative Collapse" Mystery
In the landscape of model compression, pruning—the removal of redundant parameters—is often treated as a universal gain in efficiency. However, the authors highlight a jarring inconsistency:
- The Success Case: On non-generative tasks like MMLU or retrieval (E5-Mistral), the model acts as if nothing happened even after losing significant layers.
- The Failure Case: In generative tasks like GSM8K (math reasoning) or HumanEval (coding), the performance often drops to near zero.
The motivation is simple: if we can understand why this gap exists, we can design models that are "pruning-aware" or apply pruning only where it won't hurt.
Methodology: The Three Spaces of Inference
The authors analyze the propagation of pruning noise across three distinct stages:
- Embedding Space ($h$): Hidden states across Transformer layers.
- Logit Space ($z$): The output of the final linear head ($W h$).
- Probability Space ($p$): The post-softmax distribution.
1. The Robustness of Similarity
The authors use 2nd-order Taylor expansion to prove that the linear LM Head actually preserves similarity. Because the projection is linear, it tends to attenuate the orthogonal components of noise.
Figure 1: Pruning noise $\Delta h$ is stable in logits but explodes in the probability space.
2. The Softmax "Magnifying Glass"
The core insight is found in Theorem 2 and 3. The Softmax function is highly sensitive to the variance of logit perturbations. Even if the logits are 99% similar to the original, a small shift in the weighted variance $ ext{Var}_r(\Delta z)$ results in a massive shift in the probability distribution. This is quantified via KL Divergence, which accurately predicts the generation collapse.
Multi-Scale Effects: Why Non-Generative Tasks Survive
The paper introduces two brilliant explanations for the survival of non-generative tasks:
- Subspace Stability: Multiple-choice tasks only care about the relative order of a few tokens (A, B, C, D). The authors show that while the global distribution is ruined, the "Categorical-token probability subspace" remains remarkably stable.
- Temporal Compounding: In generation, an error at Step 1 is fed back into the model as part of the context for Step 2. This feedback loop (Self-Attention) acts as an accumulator for pruning noise. Non-generative tasks are "one-shot" and skip this recursive catastrophe.
Experimental Validation
Using Mistral-7B and Qwen-2.5, the authors demonstrated that dropping layers (inter-layer) or using Wanda (intra-layer sparsification) followed the same rule.
Table 1: Note the contrast—Average MMLU (Non-generative) drops marginally (-5%), while GSM8K (Generative) collapses entirely (-100%).
The authors also compared pruning with Quantization. Interestingly, Quantization (e.g., AWQ) shows much higher similarity and lower KL Divergence in early steps, explaining why 4-bit models generate much better than 50%-pruned models despite having similar memory footprints.
Practical Insight & Conclusion
Takeaways for Engineers:
- Evaluation Matters: Don't use MMLU/Classification scores as a proxy for how well your pruned model will chat or code.
- Target the Task: If your application is a vector database (retrieval) or a classifier, you can prune aggressively with almost no loss.
- Generation needs Post-Training: Training-free pruning is almost "illegal" for generative tasks. Fine-tuning is mandatory to recalibrate the Softmax variance.
In conclusion, the bottleneck of pruning isn't the "loss of knowledge" in the embeddings—it's the nonlinear sensitivity of the probability output and the accumulation of errors over time.
