This paper introduces a theoretical critique of the Muon optimizer, a high-performance 2nd-order method that uses Newton-Schulz orthogonalization to accelerate training. By analyzing "Spectral GD" in deep linear networks, the authors demonstrate that Muon's speed comes at the cost of losing the "simplicity bias" inherent in SGD, which typically learns low-rank features before complex ones.
Executive Summary
TL;DR
Muon has recently taken the AI world by storm, powering record-breaking training runs in the "NanoGPT Speedrun." However, this paper by Dragutinović and Ranganath pulls back the curtain on its "divine benevolence." By analyzing the spectral dynamics of optimization, the authors reveal that Muon’s speed is gained by sacrificing the Simplicity Bias—the natural tendency of SGD to learn simple, low-rank structures before complex ones.
Academic Context
This work shifts the narrative from "how fast can we train?" to "what are we actually learning?" It positions Muon as a "greedy" optimizer that, while efficient, may be more prone to memorization and susceptible to spurious correlations compared to the more "patient" SGD.
The Core Intuition: The "Neural Race"
In standard Gradient Descent (GD), learning is hierarchical. The model first breaks the largest "saddle points" (the most dominant features) and only then moves on to more subtle patterns. This creates an implicit curriculum.
Muon changes the rules of the race. By orthogonalizing the gradient update (setting all singular values to 1), it forces the model to learn all features—both the signal and the noise—at the same rate.
Figure 1: SGD (Left) learns singular vectors sequentially (Mode 1 then Mode 2). Spectral GD/Muon (Right) learns them simultaneously, bypassing the protective plateaus of the loss landscape.
Methodology: Spectral GD & The Death of Simplicity Bias
The authors analyze Spectral GD—a tractable version of Muon. In a 2-layer linear network (), they compare the evolution of singular values :
- GD Dynamics: The -th mode converges at a time proportional to (the inverse of the singular value). Large modes are learned first; small modes later. This is the Simplicity Bias.
- Spectral GD Dynamics: All principal components are learned simultaneously. The convergence time is . It rushes to fit everything, including the "tail" of the data distribution.
The Problem with Zero-Bias
While learning everything at once sounds great for imbalanced data (where rare classes are usually ignored), it is dangerous for tasks requiring Shared Representations.
Experiment 1: Memorization vs. Generalization
In a "Routing" task—where a model must learn to map different input domains to common outputs—the authors found a striking divergence:
- SGD: Discovered the shared underlying structure. Even for input-output pairs it never saw during training, it generalized perfectly.
- Muon/Spectral GD: Achieved zero training loss but failed the OOD test. It chose to memorize each pathway independently rather than finding the common rule.
Figure 2: The hidden layer spectrum. SGD results in a low-rank, structured solution (rank 4), while Spectral GD yields a heavy-tailed, high-rank solution indicative of memorization.
Experiment 2: The Spurious Feature Trap
Does Muon's speed make it more likely to "cheat"? The authors tested this on MNIST by adding a single high-intensity pixel as a spurious clue for the class label.
- SGD focuses on the "dominant" feature (the digit shape) first. You can "early stop" SGD to get a model that ignores the spurious pixel.
- Muon fits the spurious pixel and the digit shape at the same time. There is no window where the model is "clean" but accurate.
Figure 3: Accuracy on sets with (dashed) and without (solid) spurious features. Muon's performance on clean data drops significantly earlier than SGD's.
Critical Insights & Conclusion
Takeaways for Researchers
- Muon is great for: Imbalanced datasets, tail-end associative memory, and scenarios where data is "clean" and requires maximum training efficiency.
- Muon is risky for: Tasks requiring causal/hierarchical reasoning (like math or code), and datasets with strong spurious correlations (like many medical or real-world CV datasets).
The Future of Optimizers
The authors suggest that the next generation of optimizers should aim for the "best of both worlds": the speed of Muon in breaking saddles, but without losing the sequential complexity increase that makes SGD so robust. As the industry moves toward adopting Muon as the new default for LLMs, we must stay vigilant about the functional properties of the weights we are training, not just the speed of the loss curve.
Final Verdict: Muon is a powerful tool, but it lacks the "patience" of SGD. Use it wisely.
