Stem: Rethinking Causal Information Flow in Sparse Attention

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Stem: Rethinking Causal Information Flow in Sparse Attention

[Stem] Rethinking Causal Information Flow: The Initial Tokens Are Your Model's "Stem"

总结

问题

方法

结果

要点

摘要

Stem is a novel, training-free sparse attention framework designed to accelerate the pre-filling phase of Large Language Models (LLMs). It introduces a Token Position-Decay (TPD) strategy and an Output-Aware Metric (OAM) to align sparsity with causal information flow, achieving superior accuracy while significantly reducing computational overhead.

TL;DR

The computational bottleneck of Large Language Models (LLMs) often lies in the quadratic complexity of self-attention during the pre-filling phase. Stem is a plug-and-play sparsity module that rethinks pruning through the lens of Information Flow. By recognizing that early tokens are "recursive anchors" that impact all subsequent representations, Stem uses a Position-Decay strategy and a Value-Magnitude metric to achieve 3.7x speedups with near-zero accuracy loss.

The "Uniform Pruning" Fallacy

Most current sparse attention methods (like H2O or MInference) treat all token positions equally, applying a fixed top-k budget across an entire layer. However, the authors of Stem argue this is fundamentally flawed in causal architectures.

In a decoder-only Transformer, the $n$ -th token aggregates information from tokens $1$ to $n$ . This means:

Token 1 participates in the calculation of every subsequent token.
Token N only participates in the calculation of the final output.

Pruning an early token creates a "global distortion" that propagates and amplifies recursively through every subsequent layer. Conversely, pruning a later token only causes local errors.

Recursive error propagation Figure 1: Visualizing how pruning initial tokens (red) vs. late tokens affects the global dependency chain.

Methodology: Safeguarding the Information Flow

1. Token Position-Decay (TPD)

Instead of a uniform budget, Stem implements a linear decay schedule. It starts with a high budget ( $k_{s t a r t}$ ) for initial tokens to ensure the "stem" of the information flow remains intact and aggressively reduces the budget for later tokens ( $k_{e n d}$ ) where redundancy is higher.

2. Output-Aware Metric (OAM)

Standard methods select tokens based on Score-Aware Metrics (SAM)—essentially how well the Query matches the Key. Stem introduces OAM, which argues that even if a match score is high, the token is irrelevant if its Value vector magnitude ( $∥ V ∥_{2}$ ) is near zero.

The selection metric is derived as: $M_{i, j} = Q_{i} K_{j}^{T} + β \cdot max (0, lo g (∥ V_{j} ∥_{2}))$ This formula ensures the model retains "high-energy signals" that actually impact the residual stream.

Figure 2: The Stem Pipeline showing the transition from coarse block-wise downsampling to fine-grained sparse aggregation.

Performance & Experiments

Stem was evaluated on heavyweight backbones including Llama-3.1-8B and Qwen3-8B across LongBench and RULER benchmarks.

Efficiency: At 128K context, Stem reduces pre-filling latency from 1540ms to 420ms (3.7x speedup).
Accuracy: On LongBench, Stem achieved an average score of 41.48% (Llama-3.1), nearly matching the Dense baseline (42.02%) while using only ~30% of the computation.
Versatility: It acts as a "booster" for training-based sparse models. When integrated into DeepSeek-V3.2, it further compressed the sparsity budget by 15% with zero performance degradation.

Latency Comparison Figure 3: Latency speedup of Stem compared to Dense and other sparse methods like MInference.

Critical Insight: Why it Works

The genius of Stem lies in moving away from pure "similarity" (QK scores) and looking at "contribution" (Output magnitude). The theoretical derivation in the Appendix proves that minimizing the reconstruction error of the attention output naturally leads to a metric that weights the routing score by the Value norm. By combining this mathematical insight with the physical reality of causal position asymmetry, Stem finds the "Pareto optimal" point for LLM inference.

Conclusion & Future Outlook

Stem demonstrates that we don't necessarily need smarter training to get faster LLMs; we need a better understanding of how information flows through existing ones. While it currently focuses on the pre-filling phase, the principles of Position-Decay could likely be extended to KV-cache eviction strategies during decoding, potentially solving the long-context memory wall once and for all.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that investigate the "attention sink" phenomenon or the special role of initial tokens in Transformer causal information flow.
Which paper first proposed block-sparse attention kernels for FlashAttention, and how does Stem's dynamic budget allocation compare to static block-sparse patterns?
Identify research that integrates magnitude-aware or value-aware metrics into sparse attention for vision transformers (ViT) or multi-modal models.

[Stem] Rethinking Causal Information Flow: The Initial Tokens Are Your Model's "Stem"

1. TL;DR

2. The "Uniform Pruning" Fallacy

3. Methodology: Safeguarding the Information Flow

3.1. 1. Token Position-Decay (TPD)

3.2. 2. Output-Aware Metric (OAM)

4. Performance & Experiments

5. Critical Insight: Why it Works

6. Conclusion & Future Outlook