From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

From Signal Degradation to Computation Collapse: Why 2-Bit Quantization Fails

Summary

Problem

Method

Results

Takeaways

Abstract

This paper identifies two distinct failure modes in LLM Post-Training Quantization (PTQ): Signal Degradation and Computation Collapse. It reveals that while 4-bit quantization typically suffers from cumulative noise, 2-bit quantization triggers a "performance cliff" due to the total breakdown of internal computational mechanisms.

TL;DR

Why does 4-bit quantization work so well while 2-bit often leads to a total model meltdown? This paper uncovers that these aren't just different degrees of the same problem. They are two different beast: Signal Degradation (noise in the system) and Computation Collapse (the system itself breaks). While we can "denoise" the former with clever tricks, the latter requires a complete structural rebuild.

The Problem: The Catastrophic "Performance Cliff"

In the world of Large Language Models (LLMs), Post-Training Quantization (PTQ) is the magic that lets us run massive models on consumer hardware. 4-bit is the "sweet spot," but moving down to 2-bit usually results in the model babbling incoherent nonsense or repeating stop words.

Prior research historically viewed this as a numerical problem—just "too much error." However, this paper argues that at 2-bit, the model's internal logic—how it recalls facts and processes tokens—actually physically breaks down.

The Two Failure Modes Hypothesis

The researchers propose a diagnostic framework to categorize these failures:

Signal Degradation (Failure Mode I): Seen in 4-bit models. The "circuitry" is fine, but the data flowing through it is noisy. The correct answer is still there, buried under layers of cumulative precision loss.
Computation Collapse (Failure Mode II): Typical of 2-bit models. The "gates" and "switches" (Attention heads and FFN layers) stop working entirely. The signal isn't just noisy; it's deleted in the very first layers of the model.

Methodology: Peering into the "Ghost in the Machine"

To prove this, the authors didn't just look at accuracy scores; they looked at the Logit Lens (mapping hidden states back to words) and Causal Tracing (patching "clean" data into a "broken" model).

1. Internal Representation Breakdown

Using Centered Kernel Alignment (CKA), they visualized how similar a quantized model's internal "thought process" is to the original FP16 version.

CKA Topology Analysis In the image above, the 4-bit model maintains a clear diagonal (meaning it's thinking similarly to the original), while the 2-bit model is a "pitch black" void, indicating total structural collapse.

2. Component Malfunction

The study found that in 2-bit models:

Attention Mechanisms lose their focus, with entropy skyrocketing. They stop "looking" at the right tokens.
FFN (Feed-Forward Networks), which act as the model's memory, suffer from a high "Sign Flip Rate." Essentially, the "on/off" switches for neurons start flipping randomly, causing the model to retrieve entirely wrong concepts.

Experimental Proof: Repairability

The authors designed a "Two-Stage Repair" for the Signal Degradation mode:

Source Protection: Keep the first two layers (the most sensitive ones) at a slightly higher precision.
Peak Signal Amplification: Boosting the "confidence" of the internal signals at the layers where they are strongest.

Logit Trajectory Results As seen above, the orange line shows how these interventions can "rescue" a 4-bit model's logic, bringing it close to the original FP16 baseline.

The Catch: These same interventions did nothing for 2-bit models. Even when the researchers gave 2-bit layers "perfect" input from an FP16 model, the 2-bit layers immediately destroyed the information. This proves that 2-bit components are fundamentally non-functional.

Critical Insight & Future Outlook

This work shifts the quantization conversation from "how much error is okay?" to "is the model's logic still intact?"

Key Takeaways for Engineers:

4-bit PTQ is highly salvageable. If your 4-bit model is failing, look at protecting the early layers or amplifying logits.
2-bit PTQ currently cannot be fixed by simple "calibration" or numerical tricks. If you need 2-bit, you likely need Quantization-Aware Training (QAT) or structural reconstruction to teach the model how to function with such low precision.

This diagnostic framework provides a roadmap for the next generation of "principled" quantization, moving away from trial-and-error toward a mechanistically grounded approach to model compression.

Find Similar Papers

Try Our Examples

Find recent papers investigating the "performance cliff" in LLM quantization, specifically focusing on the transition between 4-bit and 2-bit precision.
Which studies first identified the "RMSNorm Reversal" effect in quantized models, and how does it relate to the Computation Collapse mode described here?
Explore research that applies low-rank adaptation (LoRA) or fine-tuning specifically to repair structural breakdown in 2-bit quantized Large Language Models.

Contents

From Signal Degradation to Computation Collapse: Why 2-Bit Quantization Fails

1. TL;DR

2. The Problem: The Catastrophic "Performance Cliff"

3. The Two Failure Modes Hypothesis

4. Methodology: Peering into the "Ghost in the Machine"

4.1. 1. Internal Representation Breakdown

4.2. 2. Component Malfunction

5. Experimental Proof: Repairability

6. Critical Insight & Future Outlook