The paper introduces Attention Residuals (AttnRes), a novel architectural modification that replaces standard fixed-weight residual connections with a learned softmax attention mechanism over depth. By allowing each layer to selectively aggregate earlier representations using input-dependent weights, the method achieves SOTA performance on the Kimi Linear 48B model and consistently outperforms standard PreNorm baselines across various compute scales.
TL;DR
Standard LLMs aggregate information across depth using simple addition (), which treats every layer's contribution equally and leads to signal dilution. Attention Residuals (AttnRes) by the Kimi Team replaces this with depth-wise softmax attention. By allowing layers to selectively "query" the outputs of any preceding layer, the model manages hidden-state growth better and achieves superior reasoning capabilities. At scale, Block AttnRes provides a 1.25x compute efficiency gain with negligible training and inference overhead.
The Problem: The "Dilution" of Depth
In the current PreNorm era (used by Llama, GPT-4, etc.), each layer adds its output to a running sum. This causes two major issues:
- Monotonic Magnitude Growth: The hidden state's norm grows as . To have any impact, deeper layers must produce increasingly large outputs, leading to numerical instability.
- No Selective Retrieval: A layer at depth 80 sees a "blurry" average of the previous 79 layers. It cannot specifically look back at the output of layer 5 to retrieve a low-level feature.
The situation is analogous to RNNs before the Transformer: information is crushed into a single hidden state vector. AttnRes aims to do for depth what Self-Attention did for sequences.
Methodology: Attention Over Depth
1. From Recurrence to Retrieval
Instead of with fixed weights, AttnRes uses: Where is a softmax-normalized attention weight. Each layer has a learned pseudo-query vector that decides which previous layers are important for the current transformation.
2. Block AttnRes: Scaling to Billions
While "Full AttnRes" requires storing all layer outputs (expensive for Pipeline Parallelism), Block AttnRes partitions the layers into blocks (typically ).
- Intra-block: Layers use standard residuals.
- Inter-block: Layers attend to the compressed block representations.
Figure 1: Comparison between Standard Residuals (Left) and Attention Residuals (Right).
3. Infrastructure Optimization
To keep training fast, Kimi introduced Cross-Stage Caching, ensuring that in pipeline-parallel training, only new block representations are sent across GPUs. For inference, a Two-Phase Computation strategy batches inter-block attention queries, ensuring that the latency overhead is kept under 2%.
Experiments & Results: Consistent Gains
Scaling Laws
The authors validated AttnRes across 5 model sizes. The result is clear: AttnRes shifts the scaling curve downward. At a fixed compute budget, AttnRes achieves a lower loss, effectively providing a 1.25x compute advantage over standard Transformers.
Figure 2: Scaling Law behavior showing AttnRes consistently outperforming the baseline.
Downstream Reasoning
On the 48B Kimi Linear model, the most significant jumps were seen in complex reasoning and coding (where long-term dependency in depth likely matters most):
- GPQA-Diamond: 36.9 44.4 (+7.5)
- HumanEval: 59.1 62.2 (+3.1)
- Minerva Math: 53.5 57.1 (+3.6)
Deep Insight: Visualizing Depth-Wise Attention
When visualizing what the layers "look at" (Fig 8), AttnRes reveals fascinating patterns:
- Locality: Most layers still rely on their immediate predecessor (diagonal dominance).
- The "Attention Sink": The token embedding () is persistently queried by almost all layers, serving as an "anchor."
- Specialization: Attention layers have a broader "receptive field" in depth compared to MLP layers.
Figure 3: Learned attention weights across depth for Full vs. Block AttnRes.
Conclusion and Future Outlook
Attention Residuals marks a significant shift in how we think about the "Residual Stream." By moving from fixed addition to learned selection, we solve the layer dilution problem and allow for deeper, more expressive models.
The fact that Block AttnRes (with only 8 blocks) recovers almost all the gain of Full AttnRes (dozens of layers) suggests that LLMs don't need to look at everything at once—they just need a few high-quality checkpoints from the past. As memory bandwidth increases, we can expect "depth-wise attention" to become a standard feature in high-performance LLM architectures.
