IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

[ArXiv 2026] IndexCache: Killing the O(L²) Bottleneck in Production-Grade Sparse Attention

Summary

Problem

Method

Results

Takeaways

Abstract

IndexCache is a novel acceleration framework for DeepSeek Sparse Attention (DSA) that exploits cross-layer redundancy in token selection. By partitioning layers into "Full" layers (which compute indices) and "Shared" layers (which reuse them), it eliminates up to 75% of indexer computations, achieving 1.82x prefill and 1.48x decode speedups on a 30B model while maintaining SOTA performance.

TL;DR

DeepSeek Sparse Attention (DSA) was a breakthrough for long-context models, but its "lightning indexer" remained an expensive O(L²) tax at every layer. IndexCache shatters this bottleneck by proving that most layers don't need their own indexers—they can simply "borrow" the token selections from previous layers. With a 75% reduction in indexer workload, it delivers near 2x speedups with zero performance loss on 30B and 700B+ scale models.

The Motivation: The "Hidden" Cost of Sparsity

Modern agentic workflows demand massive context windows (200K+ tokens). While sparse attention (like DSA) reduces the core attention math to linear complexity, it introduces a "Lightning Indexer" to decide which tokens to attend to.

The industry's dirty secret? This indexer still runs an O(L²) dot-product against the entire history. As context grows, this "lightweight" module eventually eats the entire compute budget. The authors observed a crucial physical intuition: adjacent layers are remarkably consistent in what they find important. Pairwise overlap of selected tokens reaches 70-100% between neighbors, suggesting that recalculating these indices at every layer is a massive waste of FLOPs.

Methodology: Two Paths to Index Reuse

IndexCache implements a "Full (F) & Shared (S)" architecture. F layers compute and cache fresh top-k indices; S layers skip the indexer entirely and pull from the cache.

1. Training-Free: The Greedy Search

You can't just skip indexers randomly. The authors found that uniform skipping (e.g., every 4th layer) hurts accuracy because certain "anchor" layers are semantically critical. They developed a Greedy Layer Selection algorithm:

Start with all layers as "Full."
Iteratively flip the "least impactful" layer to "Shared" based on the Language Modeling (LM) loss on a small calibration set.
This creates a custom sharing pattern that respects the model's internal hierarchy.

2. Training-Aware: Multi-Layer Distillation

To maximize efficiency, the authors proposed a new distillation objective. Instead of an indexer learning only its layer's attention distribution ( $p^{(ℓ)}$ ), it is trained against the centroid (average) of all the layers it will serve.

$L_{multi}^{I} = \sum_{j = 0}^{m} \frac{1}{m + 1} \sum_{t} D_{KL} (p_{t}^{(ℓ + j)} ∥ q_{t}^{(ℓ)})$

This forces the indexer to produce a "consensus" top-k set that is robust enough for multiple downstream layers.

Core Architecture and Loop Figure: The IndexCache inference loop adds a simple conditional branch to reuse cached indices, effectively bypassing the O(L²) indexer forward pass in S layers.

Experimental Results: Near 2x Speedup

The results on the 30B DSA model are striking. At 200K context:

Prefill Speedup: 1.82x (Latency dropped from 19.5s to 10.7s).
Decode Throughput: 1.51x increase.
Accuracy: On long-context benchmarks like RULER and LongBench v2, the greedy-searched 1/4 retention pattern matched the baseline DSA performance almost perfectly.

Throughput and Prefill Speedup Figure: Relative speedup of IndexCache over the standard DSA baseline across different context lengths and retention ratios.

Critical Analysis & Future Outlook

The "Negative Result" included in the appendix is perhaps the most insightful part of the paper: the authors tried using cosine similarity of attention outputs to pick layers, but it failed. This proves that local similarity is a poor proxy for global model health; small errors in earlier layers cascade. Only the end-to-end LM loss (as used in their greedy search) captures these ripple effects.

Takeaway: IndexCache effectively standardizes "Structural Sparsity." As models like DeepSeek-V3 and GLM-5 become the backbone of AI agents, techniques that exploit cross-layer redundancy will be mandatory to keep inference costs from spiraling out of control.

Limitations: The training-aware version requires a distillation overhead during training/fine-tuning. While the training-free version is "plug-and-play," it requires a one-time greedy search on a calibration set, which adds to the deployment pipeline complexity.

Find Similar Papers

Try Our Examples

Search for recent papers of 2024-2025 that explore "cross-layer attention sharing" or "KV cache reuse" to accelerate long-context LLM inference.
Which paper originally introduced the DeepSeek Sparse Attention (DSA) mechanism, and what were the primary distillation targets used for its lightning indexer?
Investigate if the IndexCache strategy or similar index-reuse methods have been successfully applied to State Space Models (SSMs) or hybrid Transformer-Mamba architectures.

Contents

[ArXiv 2026] IndexCache: Killing the O(L²) Bottleneck in Production-Grade Sparse Attention

1. TL;DR

2. The Motivation: The "Hidden" Cost of Sparsity

3. Methodology: Two Paths to Index Reuse

3.1. 1. Training-Free: The Greedy Search

3.2. 2. Training-Aware: Multi-Layer Distillation

4. Experimental Results: Near 2x Speedup

5. Critical Analysis & Future Outlook