To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

To Use or Not to Use Muon: The Hidden Cost of Optimization Speed

总结

问题

方法

结果

要点

摘要

This paper introduces a theoretical critique of the Muon optimizer, a high-performance 2nd-order method that uses Newton-Schulz orthogonalization to accelerate training. By analyzing "Spectral GD" in deep linear networks, the authors demonstrate that Muon's speed comes at the cost of losing the "simplicity bias" inherent in SGD, which typically learns low-rank features before complex ones.

Executive Summary

TL;DR

Muon has recently taken the AI world by storm, powering record-breaking training runs in the "NanoGPT Speedrun." However, this paper by Dragutinović and Ranganath pulls back the curtain on its "divine benevolence." By analyzing the spectral dynamics of optimization, the authors reveal that Muon’s speed is gained by sacrificing the Simplicity Bias—the natural tendency of SGD to learn simple, low-rank structures before complex ones.

Academic Context

This work shifts the narrative from "how fast can we train?" to "what are we actually learning?" It positions Muon as a "greedy" optimizer that, while efficient, may be more prone to memorization and susceptible to spurious correlations compared to the more "patient" SGD.

The Core Intuition: The "Neural Race"

In standard Gradient Descent (GD), learning is hierarchical. The model first breaks the largest "saddle points" (the most dominant features) and only then moves on to more subtle patterns. This creates an implicit curriculum.

Muon changes the rules of the race. By orthogonalizing the gradient update (setting all singular values to 1), it forces the model to learn all features—both the signal and the noise—at the same rate.

Theory Illustration Figure 1: SGD (Left) learns singular vectors sequentially (Mode 1 then Mode 2). Spectral GD/Muon (Right) learns them simultaneously, bypassing the protective plateaus of the loss landscape.

Methodology: Spectral GD & The Death of Simplicity Bias

The authors analyze Spectral GD—a tractable version of Muon. In a 2-layer linear network ( $y = V U x$ ), they compare the evolution of singular values $σ_{k}$ :

GD Dynamics: The $k$ -th mode converges at a time proportional to $s_{k}^{- 1}$ (the inverse of the singular value). Large modes are learned first; small modes later. This is the Simplicity Bias.
Spectral GD Dynamics: All principal components are learned simultaneously. The convergence time is $\propto s_{k}$ . It rushes to fit everything, including the "tail" of the data distribution.

The Problem with Zero-Bias

While learning everything at once sounds great for imbalanced data (where rare classes are usually ignored), it is dangerous for tasks requiring Shared Representations.

Experiment 1: Memorization vs. Generalization

In a "Routing" task—where a model must learn to map different input domains to common outputs—the authors found a striking divergence:

SGD: Discovered the shared underlying structure. Even for input-output pairs it never saw during training, it generalized perfectly.
Muon/Spectral GD: Achieved zero training loss but failed the OOD test. It chose to memorize each pathway independently rather than finding the common rule.

Routing Task Result Figure 2: The hidden layer spectrum. SGD results in a low-rank, structured solution (rank 4), while Spectral GD yields a heavy-tailed, high-rank solution indicative of memorization.

Experiment 2: The Spurious Feature Trap

Does Muon's speed make it more likely to "cheat"? The authors tested this on MNIST by adding a single high-intensity pixel as a spurious clue for the class label.

SGD focuses on the "dominant" feature (the digit shape) first. You can "early stop" SGD to get a model that ignores the spurious pixel.
Muon fits the spurious pixel and the digit shape at the same time. There is no window where the model is "clean" but accurate.

Spurious Features Comparison Figure 3: Accuracy on sets with (dashed) and without (solid) spurious features. Muon's performance on clean data drops significantly earlier than SGD's.

Critical Insights & Conclusion

Takeaways for Researchers

Muon is great for: Imbalanced datasets, tail-end associative memory, and scenarios where data is "clean" and requires maximum training efficiency.
Muon is risky for: Tasks requiring causal/hierarchical reasoning (like math or code), and datasets with strong spurious correlations (like many medical or real-world CV datasets).

The Future of Optimizers

The authors suggest that the next generation of optimizers should aim for the "best of both worlds": the speed of Muon in breaking saddles, but without losing the sequential complexity increase that makes SGD so robust. As the industry moves toward adopting Muon as the new default for LLMs, we must stay vigilant about the functional properties of the weights we are training, not just the speed of the loss curve.

Final Verdict: Muon is a powerful tool, but it lacks the "patience" of SGD. Use it wisely.

发现相似论文

试试这些示例

Search for recent papers investigating the trade-off between training speed and out-of-distribution (OOD) generalization in 2nd-order or spectral optimizers like Shampoo or Muon.
Which prior works established the "saddle-to-saddle" dynamics theory in deep linear networks, and how does the concept of "silent alignment" influence the simplicity bias mentioned here?
Identify research that applies the Muon optimizer to large-scale transformer training (e.g., LLMs) and specifically analyzes its impact on the model's ability to perform logical reasoning or mathematical induction.

To Use or Not to Use Muon: The Hidden Cost of Optimization Speed

1. Executive Summary

1.1. TL;DR

1.2. Academic Context

2. The Core Intuition: The "Neural Race"

3. Methodology: Spectral GD & The Death of Simplicity Bias

3.1. The Problem with Zero-Bias

4. Experiment 1: Memorization vs. Generalization

5. Experiment 2: The Spurious Feature Trap

6. Critical Insights & Conclusion

6.1. Takeaways for Researchers

6.2. The Future of Optimizers