WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[ICLR 2026] Memory Caching: RNNs with Growing Memory - Bridging the Gap to Transformers
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces Memory Caching (MC), a framework that boosts Recurrent Neural Networks (RNNs) by caching intermediate hidden states as checkpoints. Applied to models like Titans and Linear Attention, MC achieves competitive performance with Transformers on recall-intensive tasks while maintaining sub-quadratic complexity.

TL;DR

The dominance of Transformers relies on their "infinite" memory, which grows with sequence length—at a quadratic cost. Modern RNNs (SSMs, Linear Attention) offer efficiency but fail at retrieval because their fixed-size memory "forgets." Memory Caching (MC) is a simple, plug-and-play architectural upgrade that caches snapshots of RNN hidden states. This allows the model's memory to grow over time, enabling performance levels with efficiency.

Background: The Compression-Retrieval Trade-off

In the landscape of sequence modeling, we have two extremes:

  1. Transformers: Perfect recall, but the KV-cache acts as a computational "tax" that grows until the system crashes.
  2. RNNs/SSMs: Highly efficient , but they force the entire history into a fixed-size vector. This "lossy compression" is the Achilles' heel of models like Mamba or RWKV when faced with "Needle-in-a-Haystack" tasks.

Memory Caching (MC) asks: What if we didn't have to choose? By periodically caching the state of the RNN, we create a searchable history of compressed "zip files" of the past.

Methodology: How Memory Caching Works

The core intuition is to treat the recurrent hidden state not as a single bucket, but as a series of checkpoints. The sequence is divided into segments, and the memory state at the end of each segment is stored in a cache.

The Architecture

Instead of the standard RNN output , MC uses an aggregation function:

Overall Architecture

Key Variants:

  • Gated Residual Memory (GRM): Uses a context-aware gating mechanism to decide which past segment is relevant to the current query.
  • Memory Soup: Instead of aggregating outputs, it averages the parameters of cached memory modules to create a specialized test-time retrieval function.
  • Sparse Selective Caching (SSC): A "best of both worlds" approach using a router to pick the Top-K most relevant segments, keeping the overhead minimal even for ultra-long sequences.

Sparse Selective Caching

Experiments: Breaking the RNN Bottleneck

The authors tested MC across Titans, Deep Linear Attention (DLA), and SWLA. The results (Table 1 & 2) reveal that MC-enhanced variants consistently outperform state-of-the-art recurrent models like RWKV-7 and RetNet.

Highlights:

  • Needle-in-a-Haystack: In 16K context retrieval, standard DLA often fails (scores near 4.0), whereas DLA + GRM jumps to 82.4.
  • Efficiency: While adding a cache sounds expensive, the training throughput remains significantly higher than Transformers as sequence length increases. SSC, in particular, tracks closely with base RNN efficiency while delivering Transformer-class recall.

Throughput Comparison

Deep Insight: Is this just Attention in Disguise?

One might argue that caching every token's state makes this a Transformer. However, the authors show that if you cache segments (e.g., every 256 tokens), you are performing a hierarchical retrieval. You aren't attending to every token; you are attending to a compressed summary (the RNN state) of each block. This provides a "Middle Path" in the complexity-performance trade-off.

Critical Analysis & Conclusion

Takeaway: MC proves that we don't need the "brute force" of full Attention. If the underlying RNN is good at local compression (like Titans or DLA), then caching the states periodically is enough to solve the "forgetting" problem.

Limitations:

  • Segment Tuning: The performance is sensitive to segment size. Too large, and you lose resolution; too small, and complexity approaches quadratic.
  • Training Complexity: Implementing efficient parallelized kernels for gated aggregation across segments (essentially a sparse cross-attention over states) remains non-trivial.

Future Outlook: Memory Caching simplifies the path toward "Infinite Context" models by defining a clear hierarchy: the RNN handles the micro-context through recurrence, while the Cache handles the macro-context through retrieval.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize checkpointing or state-caching to extend the context window of sub-quadratic models like Mamba or RWKV.
  • Which original studies introduced the concept of "Fast Weight Programmers," and how does the Memory Caching framework's update rule specifically differ from those early meta-learning approaches?
  • Explore research that applies sparse routing or Mixture-of-Experts (MoE) techniques to the temporal/sequence dimension of hidden states rather than just parameter scaling.
Contents
[ICLR 2026] Memory Caching: RNNs with Growing Memory - Bridging the Gap to Transformers
1. TL;DR
2. Background: The Compression-Retrieval Trade-off
3. Methodology: How Memory Caching Works
3.1. The Architecture
3.2. Key Variants:
4. Experiments: Breaking the RNN Bottleneck
4.1. Highlights:
5. Deep Insight: Is this just Attention in Disguise?
6. Critical Analysis & Conclusion