The paper investigates the mechanistic link between "massive activations" (extreme outliers in hidden channels) and "attention sinks" (disproportionate attention mass on specific tokens) in Transformer LLMs. It demonstrates that these phenomena are decoupled architectural artifacts of Pre-Norm designs that can be independently suppressed without degrading performance.
TL;DR
Why do LLMs exhibit extreme numerical outliers and "sink" their attention into seemingly meaningless tokens like the first position? A new study reveals that these two phenomena are not "bugs" but "architectural artifacts." By tracing the life cycle of these spikes, the authors show that they are generated by a specific interaction between the SwiGLU FFN and RMSNorm. Crucially, these can be decoupled: you can kill the spikes (to help quantization) without breaking the model's logic.
Executive Summary
In the world of Large Language Models (LLMs), two ghosts have haunted researchers:
- Massive Activations: A few channels in a few tokens that suddenly spike to values 1000x larger than others.
- Attention Sinks: The weird tendency for models to dump huge amounts of attention mass onto the initial token (BOS), regardless of its relevance.
This paper provides a "mechanistic anatomy" of these phenomena. It argues that massive activations act as global implicit parameters, while attention sinks act as local routing modulators. Most importantly, the paper proves these are products of the Pre-Norm architecture and short-context training, not inherent requirements of intelligence.
The Life Cycle of a Spike
The authors identify a clear "Rise–Plateau–Fall" trajectory for massive activations across the layers of models like Llama and Qwen.
1. The Step-Up Blocks
Early in the network (usually around block 4 or 8), specific Feed-Forward blocks act as directional quadratic amplifiers. When a token's representation aligns with a specific "trigger direction" (often the first token due to its unique attention profile), the SwiGLU activation function squares and amplifies the signal into a massive outlier.
2. Residual Accumulation
Because modern LLMs use a "Pre-Norm" residual connection, these massive values are added to the "stream" and persist across almost all intermediate layers.
3. The Step-Down Blocks
Near the very end of the network, "Step-Down" blocks produce an additive inverse—a spike of equal magnitude but opposite sign—to neutralize the outlier before the final prediction head.
Figure 1: The magnitude of activations (top) and the specific blocks that inject/neutralize them (bottom) across Llama 2 and Qwen3.
From Spikes to Sinks: The Geometric Bridge
How does a numerical spike in a channel lead to an attention sink? The bridge is Normalization (RMSNorm).
When a token has massive outliers in a few channels, the RMSNorm operation does three things:
- Bounded Range: It squashes the massive values back to a stable scale ().
- Sparsification: Because the outliers dominate the norm, every other "normal" channel gets suppressed toward zero.
- Near-Constancy: Different tokens with similar outliers become almost identical vectors after normalization (Cosine similarity near 1.0).
This creates a stable geometric anchor. Attention heads can easily learn to separate these "constant" sink keys from normal "semantic" keys.
Figure 2: t-SNE visualization showing how sink keys (red) are geometrically isolated from ordinary keys, allowing queries to easily "dump" attention there.
Decoupling the Phenomena: The Ablation Evidence
The most striking finding is that we can eliminate massive activations without losing performance or the attention sink behavior.
- Normalization Ablation: By using Sandwich Norm (adding a norm after the block) or QKNorm, the researchers virtually eliminated spikes (reducing "Spike" values from 3818 to 92) while maintaining the model's accuracy.
- Sinks as Implicit Gating: When the authors added Conditional Gating, the attention sinks disappeared. This suggests that the model only uses "sinks" as a desperate workaround to "turn off" heads that it doesn't need for a specific token.
- The Context Length Factor: Sinks are a byproduct of training on short sequences. If a model is trained exclusively on long contexts, the "sink ratio" collapses, as the model no longer finds the first token a useful global anchor.
Critical Insight: Implications for the Future
The co-occurrence of spikes and sinks is incidental, not functional.
For engineers, this is great news:
- Quantization: We can use architectures like QKNorm to prevent outliers during training, making 4-bit or 8-bit quantization much easier later.
- Inference Energy: High activations are numerically unstable. Eliminating them makes models more robust.
- Efficiency: Understanding that sinks are just "implicit gates" means we could replace them with explicit, more efficient gating mechanisms.
Conclusion
This "Anatomy" suggests that the "weird" behaviors of Transformers aren't mysteries of emergent intelligence, but rather predictable results of how we build them (Pre-Norm) and how we train them (Short Context). By changing the normalization and gating, we can build cleaner, more efficient models that don't need to "scream" in a few channels just to stay focused.
