Stem is a novel, training-free sparse attention framework designed to accelerate the pre-filling phase of Large Language Models (LLMs). It introduces a Token Position-Decay (TPD) strategy and an Output-Aware Metric (OAM) to align sparsity with causal information flow, achieving superior accuracy while significantly reducing computational overhead.
TL;DR
The computational bottleneck of Large Language Models (LLMs) often lies in the quadratic complexity of self-attention during the pre-filling phase. Stem is a plug-and-play sparsity module that rethinks pruning through the lens of Information Flow. By recognizing that early tokens are "recursive anchors" that impact all subsequent representations, Stem uses a Position-Decay strategy and a Value-Magnitude metric to achieve 3.7x speedups with near-zero accuracy loss.
The "Uniform Pruning" Fallacy
Most current sparse attention methods (like H2O or MInference) treat all token positions equally, applying a fixed top-k budget across an entire layer. However, the authors of Stem argue this is fundamentally flawed in causal architectures.
In a decoder-only Transformer, the -th token aggregates information from tokens to . This means:
- Token 1 participates in the calculation of every subsequent token.
- Token N only participates in the calculation of the final output.
Pruning an early token creates a "global distortion" that propagates and amplifies recursively through every subsequent layer. Conversely, pruning a later token only causes local errors.
Figure 1: Visualizing how pruning initial tokens (red) vs. late tokens affects the global dependency chain.
Methodology: Safeguarding the Information Flow
1. Token Position-Decay (TPD)
Instead of a uniform budget, Stem implements a linear decay schedule. It starts with a high budget () for initial tokens to ensure the "stem" of the information flow remains intact and aggressively reduces the budget for later tokens () where redundancy is higher.
2. Output-Aware Metric (OAM)
Standard methods select tokens based on Score-Aware Metrics (SAM)—essentially how well the Query matches the Key. Stem introduces OAM, which argues that even if a match score is high, the token is irrelevant if its Value vector magnitude () is near zero.
The selection metric is derived as: This formula ensures the model retains "high-energy signals" that actually impact the residual stream.
Figure 2: The Stem Pipeline showing the transition from coarse block-wise downsampling to fine-grained sparse aggregation.
Performance & Experiments
Stem was evaluated on heavyweight backbones including Llama-3.1-8B and Qwen3-8B across LongBench and RULER benchmarks.
- Efficiency: At 128K context, Stem reduces pre-filling latency from 1540ms to 420ms (3.7x speedup).
- Accuracy: On LongBench, Stem achieved an average score of 41.48% (Llama-3.1), nearly matching the Dense baseline (42.02%) while using only ~30% of the computation.
- Versatility: It acts as a "booster" for training-based sparse models. When integrated into DeepSeek-V3.2, it further compressed the sparsity budget by 15% with zero performance degradation.
Figure 3: Latency speedup of Stem compared to Dense and other sparse methods like MInference.
Critical Insight: Why it Works
The genius of Stem lies in moving away from pure "similarity" (QK scores) and looking at "contribution" (Output magnitude). The theoretical derivation in the Appendix proves that minimizing the reconstruction error of the attention output naturally leads to a metric that weights the routing score by the Value norm. By combining this mathematical insight with the physical reality of causal position asymmetry, Stem finds the "Pareto optimal" point for LLM inference.
Conclusion & Future Outlook
Stem demonstrates that we don't necessarily need smarter training to get faster LLMs; we need a better understanding of how information flows through existing ones. While it currently focuses on the pre-filling phase, the principles of Position-Decay could likely be extended to KV-cache eviction strategies during decoding, potentially solving the long-context memory wall once and for all.
