Attention Residuals

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Attention Residuals

[Technical Report] Attention Residuals: Breaking the $O(L)$ Dilution Bottleneck in Deep LLMs

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Attention Residuals (AttnRes), a novel architectural modification that replaces standard fixed-weight residual connections with a learned softmax attention mechanism over depth. By allowing each layer to selectively aggregate earlier representations using input-dependent weights, the method achieves SOTA performance on the Kimi Linear 48B model and consistently outperforms standard PreNorm baselines across various compute scales.

TL;DR

Standard LLMs aggregate information across depth using simple addition ( $h_{l} = h_{l - 1} + f_{l - 1}$ ), which treats every layer's contribution equally and leads to signal dilution. Attention Residuals (AttnRes) by the Kimi Team replaces this with depth-wise softmax attention. By allowing layers to selectively "query" the outputs of any preceding layer, the model manages hidden-state growth better and achieves superior reasoning capabilities. At scale, Block AttnRes provides a 1.25x compute efficiency gain with negligible training and inference overhead.

The Problem: The "Dilution" of Depth

In the current PreNorm era (used by Llama, GPT-4, etc.), each layer adds its output to a running sum. This causes two major issues:

Monotonic Magnitude Growth: The hidden state's norm grows as $O (L)$ . To have any impact, deeper layers must produce increasingly large outputs, leading to numerical instability.
No Selective Retrieval: A layer at depth 80 sees a "blurry" average of the previous 79 layers. It cannot specifically look back at the output of layer 5 to retrieve a low-level feature.

The situation is analogous to RNNs before the Transformer: information is crushed into a single hidden state vector. AttnRes aims to do for depth what Self-Attention did for sequences.

Methodology: Attention Over Depth

1. From Recurrence to Retrieval

Instead of $h_{l} = \sum v_{i}$ with fixed weights, AttnRes uses: $h_{l} = \sum_{i = 0}^{l - 1} α_{i o l} \cdot v_{i}$ Where $α$ is a softmax-normalized attention weight. Each layer has a learned pseudo-query vector $w_{l}$ that decides which previous layers are important for the current transformation.

2. Block AttnRes: Scaling to Billions

While "Full AttnRes" requires storing all layer outputs (expensive for Pipeline Parallelism), Block AttnRes partitions the $L$ layers into $N$ blocks (typically $N \approx 8$ ).

Intra-block: Layers use standard residuals.
Inter-block: Layers attend to the $N$ compressed block representations.

Model Architecture Figure 1: Comparison between Standard Residuals (Left) and Attention Residuals (Right).

3. Infrastructure Optimization

To keep training fast, Kimi introduced Cross-Stage Caching, ensuring that in pipeline-parallel training, only new block representations are sent across GPUs. For inference, a Two-Phase Computation strategy batches inter-block attention queries, ensuring that the latency overhead is kept under 2%.

Experiments & Results: Consistent Gains

Scaling Laws

The authors validated AttnRes across 5 model sizes. The result is clear: AttnRes shifts the scaling curve downward. At a fixed compute budget, AttnRes achieves a lower loss, effectively providing a 1.25x compute advantage over standard Transformers.

Scaling Law Curves Figure 2: Scaling Law behavior showing AttnRes consistently outperforming the baseline.

Downstream Reasoning

On the 48B Kimi Linear model, the most significant jumps were seen in complex reasoning and coding (where long-term dependency in depth likely matters most):

GPQA-Diamond: 36.9 $o$ 44.4 (+7.5)
HumanEval: 59.1 $o$ 62.2 (+3.1)
Minerva Math: 53.5 $o$ 57.1 (+3.6)

Deep Insight: Visualizing Depth-Wise Attention

When visualizing what the layers "look at" (Fig 8), AttnRes reveals fascinating patterns:

Locality: Most layers still rely on their immediate predecessor (diagonal dominance).
The "Attention Sink": The token embedding ( $h_{1}$ ) is persistently queried by almost all layers, serving as an "anchor."
Specialization: Attention layers have a broader "receptive field" in depth compared to MLP layers.

Figure 3: Learned attention weights across depth for Full vs. Block AttnRes.

Conclusion and Future Outlook

Attention Residuals marks a significant shift in how we think about the "Residual Stream." By moving from fixed addition to learned selection, we solve the layer dilution problem and allow for deeper, more expressive models.

The fact that Block AttnRes (with only 8 blocks) recovers almost all the gain of Full AttnRes (dozens of layers) suggests that LLMs don't need to look at everything at once—they just need a few high-quality checkpoints from the past. As memory bandwidth increases, we can expect "depth-wise attention" to become a standard feature in high-performance LLM architectures.

Find Similar Papers

Try Our Examples

Which recent papers explore "depth-wise attention" or "cross-layer attention" in Transformer architectures beyond DenseFormer and Hyper-Connections?
What is the theoretical origin of the "time-depth duality" in neural networks, and how does AttnRes conceptually differ from State Space Models (SSMs) in this context?
Are there studies applying selective layer aggregation to Vision Transformers (ViT) or Diffusion Models to solve information dilution in very deep encoders?

Contents

[Technical Report] Attention Residuals: Breaking the $O(L)$ Dilution Bottleneck in Deep LLMs

1. TL;DR

2. The Problem: The "Dilution" of Depth

3. Methodology: Attention Over Depth

3.1. 1. From Recurrence to Retrieval

3.2. 2. Block AttnRes: Scaling to Billions

3.3. 3. Infrastructure Optimization

4. Experiments & Results: Consistent Gains

4.1. Scaling Laws

4.2. Downstream Reasoning

5. Deep Insight: Visualizing Depth-Wise Attention

6. Conclusion and Future Outlook