IndexCache is a novel acceleration framework for DeepSeek Sparse Attention (DSA) that exploits cross-layer redundancy in token selection. By partitioning layers into "Full" layers (which compute indices) and "Shared" layers (which reuse them), it eliminates up to 75% of indexer computations, achieving 1.82x prefill and 1.48x decode speedups on a 30B model while maintaining SOTA performance.
TL;DR
DeepSeek Sparse Attention (DSA) was a breakthrough for long-context models, but its "lightning indexer" remained an expensive O(L²) tax at every layer. IndexCache shatters this bottleneck by proving that most layers don't need their own indexers—they can simply "borrow" the token selections from previous layers. With a 75% reduction in indexer workload, it delivers near 2x speedups with zero performance loss on 30B and 700B+ scale models.
The Motivation: The "Hidden" Cost of Sparsity
Modern agentic workflows demand massive context windows (200K+ tokens). While sparse attention (like DSA) reduces the core attention math to linear complexity, it introduces a "Lightning Indexer" to decide which tokens to attend to.
The industry's dirty secret? This indexer still runs an O(L²) dot-product against the entire history. As context grows, this "lightweight" module eventually eats the entire compute budget. The authors observed a crucial physical intuition: adjacent layers are remarkably consistent in what they find important. Pairwise overlap of selected tokens reaches 70-100% between neighbors, suggesting that recalculating these indices at every layer is a massive waste of FLOPs.
Methodology: Two Paths to Index Reuse
IndexCache implements a "Full (F) & Shared (S)" architecture. F layers compute and cache fresh top-k indices; S layers skip the indexer entirely and pull from the cache.
1. Training-Free: The Greedy Search
You can't just skip indexers randomly. The authors found that uniform skipping (e.g., every 4th layer) hurts accuracy because certain "anchor" layers are semantically critical. They developed a Greedy Layer Selection algorithm:
- Start with all layers as "Full."
- Iteratively flip the "least impactful" layer to "Shared" based on the Language Modeling (LM) loss on a small calibration set.
- This creates a custom sharing pattern that respects the model's internal hierarchy.
2. Training-Aware: Multi-Layer Distillation
To maximize efficiency, the authors proposed a new distillation objective. Instead of an indexer learning only its layer's attention distribution (), it is trained against the centroid (average) of all the layers it will serve.
This forces the indexer to produce a "consensus" top-k set that is robust enough for multiple downstream layers.
Figure: The IndexCache inference loop adds a simple conditional branch to reuse cached indices, effectively bypassing the O(L²) indexer forward pass in S layers.
Experimental Results: Near 2x Speedup
The results on the 30B DSA model are striking. At 200K context:
- Prefill Speedup: 1.82x (Latency dropped from 19.5s to 10.7s).
- Decode Throughput: 1.51x increase.
- Accuracy: On long-context benchmarks like RULER and LongBench v2, the greedy-searched 1/4 retention pattern matched the baseline DSA performance almost perfectly.
Figure: Relative speedup of IndexCache over the standard DSA baseline across different context lengths and retention ratios.
Critical Analysis & Future Outlook
The "Negative Result" included in the appendix is perhaps the most insightful part of the paper: the authors tried using cosine similarity of attention outputs to pick layers, but it failed. This proves that local similarity is a poor proxy for global model health; small errors in earlier layers cascade. Only the end-to-end LM loss (as used in their greedy search) captures these ripple effects.
Takeaway: IndexCache effectively standardizes "Structural Sparsity." As models like DeepSeek-V3 and GLM-5 become the backbone of AI agents, techniques that exploit cross-layer redundancy will be mandatory to keep inference costs from spiraling out of control.
Limitations: The training-aware version requires a distillation overhead during training/fine-tuning. While the training-free version is "plug-and-play," it requires a one-time greedy search on a calibration set, which adds to the deployment pipeline complexity.
