The paper introduces Memory Caching (MC), a framework that boosts Recurrent Neural Networks (RNNs) by caching intermediate hidden states as checkpoints. Applied to models like Titans and Linear Attention, MC achieves competitive performance with Transformers on recall-intensive tasks while maintaining sub-quadratic complexity.
TL;DR
The dominance of Transformers relies on their "infinite" memory, which grows with sequence length—at a quadratic cost. Modern RNNs (SSMs, Linear Attention) offer efficiency but fail at retrieval because their fixed-size memory "forgets." Memory Caching (MC) is a simple, plug-and-play architectural upgrade that caches snapshots of RNN hidden states. This allows the model's memory to grow over time, enabling performance levels with efficiency.
Background: The Compression-Retrieval Trade-off
In the landscape of sequence modeling, we have two extremes:
- Transformers: Perfect recall, but the KV-cache acts as a computational "tax" that grows until the system crashes.
- RNNs/SSMs: Highly efficient , but they force the entire history into a fixed-size vector. This "lossy compression" is the Achilles' heel of models like Mamba or RWKV when faced with "Needle-in-a-Haystack" tasks.
Memory Caching (MC) asks: What if we didn't have to choose? By periodically caching the state of the RNN, we create a searchable history of compressed "zip files" of the past.
Methodology: How Memory Caching Works
The core intuition is to treat the recurrent hidden state not as a single bucket, but as a series of checkpoints. The sequence is divided into segments, and the memory state at the end of each segment is stored in a cache.
The Architecture
Instead of the standard RNN output , MC uses an aggregation function:

Key Variants:
- Gated Residual Memory (GRM): Uses a context-aware gating mechanism to decide which past segment is relevant to the current query.
- Memory Soup: Instead of aggregating outputs, it averages the parameters of cached memory modules to create a specialized test-time retrieval function.
- Sparse Selective Caching (SSC): A "best of both worlds" approach using a router to pick the Top-K most relevant segments, keeping the overhead minimal even for ultra-long sequences.

Experiments: Breaking the RNN Bottleneck
The authors tested MC across Titans, Deep Linear Attention (DLA), and SWLA. The results (Table 1 & 2) reveal that MC-enhanced variants consistently outperform state-of-the-art recurrent models like RWKV-7 and RetNet.
Highlights:
- Needle-in-a-Haystack: In 16K context retrieval, standard DLA often fails (scores near 4.0), whereas DLA + GRM jumps to 82.4.
- Efficiency: While adding a cache sounds expensive, the training throughput remains significantly higher than Transformers as sequence length increases. SSC, in particular, tracks closely with base RNN efficiency while delivering Transformer-class recall.

Deep Insight: Is this just Attention in Disguise?
One might argue that caching every token's state makes this a Transformer. However, the authors show that if you cache segments (e.g., every 256 tokens), you are performing a hierarchical retrieval. You aren't attending to every token; you are attending to a compressed summary (the RNN state) of each block. This provides a "Middle Path" in the complexity-performance trade-off.
Critical Analysis & Conclusion
Takeaway: MC proves that we don't need the "brute force" of full Attention. If the underlying RNN is good at local compression (like Titans or DLA), then caching the states periodically is enough to solve the "forgetting" problem.
Limitations:
- Segment Tuning: The performance is sensitive to segment size. Too large, and you lose resolution; too small, and complexity approaches quadratic.
- Training Complexity: Implementing efficient parallelized kernels for gated aggregation across segments (essentially a sparse cross-attention over states) remains non-trivial.
Future Outlook: Memory Caching simplifies the path toward "Infinite Context" models by defining a clear hierarchy: the RNN handles the micro-context through recurrence, while the Cache handles the macro-context through retrieval.
