M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

[ArXiv 2025] M2RNN: Breaking the Complexity Ceiling of Linear RNNs with Matrix-Valued States

总结

问题

方法

结果

要点

摘要

M2RNN (Matrix-to-Matrix RNN) is a novel non-linear RNN architecture designed for scalable language modeling, featuring matrix-valued hidden states and expressive non-linear transitions. It achieves perfect state-tracking generalization and outperforms state-of-the-art hybrid models like Gated DeltaNet and Mamba-2 by 0.4–0.5 perplexity points in hybrid settings.

TL;DR

M2RNN resurrects non-linear RNNs for the LLM era by replacing the classic vector hidden state with a high-capacity matrix. By combining non-linear state transitions (to solve complex logic) with outer-product expansion (to store massive context), M2RNN outperforms Mamba-2 and Gated DeltaNet in both efficiency and accuracy, particularly in "hard" tasks like entity tracking and code execution.

The Motivation: Why Linear RNNs and Transformers Fail at Logic

The AI community has recently flocked to Linear RNNs (SSMs) like Mamba for their $O (L)$ efficiency. However, there is a hidden cost: Expressivity.

Theoretical analysis shows that Transformers and Linear RNNs (with diagonal transitions) reside in the TC0 complexity class. This means they are fundamentally incapable of solving certain state-tracking problems, such as evaluating complex code or tracking nested entities, which require NC1 complexity.

Non-linear RNNs (like LSTMs) have the required complexity but were abandoned because:

Tiny States: Their vector states $h_{t} \in R^{d}$ are too small to compete with the KV-caches of Transformers.
Hardware Unfriendly: Non-linearities prevent the use of "Parallel Scan" algorithms, making them slow to train on GPUs.

The Core Innovation: Matrix-to-Matrix Recurrence

M2RNN solves this by upgrading the hidden state to a matrix $H_{t} \in R^{K im es V}$ .

1. The Recurrence Equation

Instead of a simple vector addition, M2RNN uses a $t anh$ non-linearity wrapped around a matrix-valued update: $Z_{t} = anh (H_{t - 1} W + k_{t} v_{t}^{o} p)$ $H_{t} = f_{t} H_{t - 1} + (1 - f_{t}) Z_{t}$

Here, $k_{t} v_{t}^{o} p$ is an outer product that allows the model to "write" new information into the matrix state efficiently, mimicking the high capacity of Linear Attention while maintaining a non-linear transition.

2. Specialized Hardware Kernels

The authors address training inefficiency by noticing that since the state is a matrix, the recurrence update becomes a GEMM (General Matrix Multiply). Unlike vector RNNs that waste 75% of FLOPs on padding to keep Tensor Cores busy, M2RNN's matrix dimensions naturally fit the hardware (e.g., $64 im es 16$ ), allowing for peak GPU utilization even with small batch sizes.

M2RNN Architecture

Experiments: Scaling to 7B MoE

The researchers tested M2RNN at the 410M (Dense) and 7B (MoE) scales.

Logic Master: In the S3 permutation task (a proxy for hard logic), M2RNN achieved perfect generalization to sequence lengths 4x longer than seen in training, whereas linear models failed completely.
The Power of Hybrids: While a pure M2RNN is strong, it shines brightest as a Hybrid. Interleaving M2RNN layers with Attention and Linear RNNs yields the best results.

Long-Context Generalization

On LongBench, adding even a single M2RNN layer to a Gated DeltaNet hybrid resulted in a massive 8-point jump in accuracy. This suggests that M2RNN layers act as "logic anchors" in a sea of linear retrieval layers.

S3 Task Performance

Conclusion: A New Building Block for LLMs

For years, we believed we had to choose between the expressivity of non-linear RNNs and the scalability of linear models. M2RNN proves this is a false dichotomy. By simply expanding the state from a vector to a matrix and optimizing the GPU kernels, we can build models that are both hardware-efficient and logically rigorous.

Key Limit: Non-linear layers are still more expensive than linear ones. However, as the authors show, you don't need many of them—one M2RNN layer every 8-16 layers might be the "secret sauce" for the next generation of reasoning-heavy models.

Throughput Comparison

发现相似论文

试试这些示例

Search for recent papers that combine non-linear recurrent transitions with matrix-valued states for long-context retrieval tasks.
Which paper first established the TC0 complexity limitation for Transformers, and how does M2RNN's NC1 completeness specifically enable better code execution modeling?
Explore if the Matrix-to-Matrix RNN architecture has been applied to multimodal tasks or video generation where state-tracking is critical.

[ArXiv 2025] M2RNN: Breaking the Complexity Ceiling of Linear RNNs with Matrix-Valued States

1. TL;DR

2. The Motivation: Why Linear RNNs and Transformers Fail at Logic

3. The Core Innovation: Matrix-to-Matrix Recurrence

3.1. 1. The Recurrence Equation

3.2. 2. Specialized Hardware Kernels

4. Experiments: Scaling to 7B MoE

4.1. Long-Context Generalization

5. Conclusion: A New Building Block for LLMs