This paper introduces a unified framework to study the "Edge of Stability" (EoS) phenomenon across non-Euclidean optimization algorithms. By defining a generalized sharpness measure based on arbitrary norms, the authors demonstrate that diverse optimizers like SignGD, Muon, and Block Coordinate Descent all exhibit progressive sharpening followed by stabilization at the 2/η threshold.
TL;DR
The "Edge of Stability" (EoS)—the phenomenon where training sharpness hovers at the limit of convergence ()—is not just for standard Gradient Descent. This paper proves, through a new metric called Generalized Sharpness, that EoS is a universal law governing virtually all first-order optimizers, including SignGD, Muon, and Block Coordinate Descent.
Background: The Mystery of the Ceiling
In classical optimization theory, we assume a function is -smooth and keep the step size to ensure a monotonic decrease in loss. However, deep learning famously breaks this rule. In 2021, Cohen et al. observed that neural networks undergo Progressive Sharpening, where the top eigenvalue of the Hessian () grows until it hits the "Edge of Stability" at . At this point, the loss begins to oscillate but training continues to progress.
Until now, this was largely a Euclidean story. But modern LLMs use optimizers like Muon (Spectral norm) or SignGD ( norm). Do these follow the same rules? This paper says: Yes, but you’ve been measuring sharpness with the wrong ruler.
The Core Concept: Generalized Sharpness
The authors argue that "sharpness" must be defined relative to the geometry of the optimizer. If your optimizer moves based on the norm (like SignGD), measuring the spectral radius is meaningless.
They define Generalized Sharpness as:
abla^2 \mathcal{L}(\mathbf{w}) \mathbf{d}$$ This definition changes based on the norm $\|\cdot\|$: * **$\ell_2$ Norm**: Recovers standard sharpness ($\lambda_{max}(H)$). * **$\ell_\infty$ Norm**: Equivalent to a Max-Cut type problem (Ising spin glass), which is NP-hard but describes SignGD stability. * **Spectral Norm**: Describes the stability limit for the high-performance Muon optimizer. ### Architecture-Aware Analysis The authors utilize the **Frank-Wolfe algorithm** to solve this maximization problem during training, providing the first real-time look at stability for "exotic" geometries.  *(Figure 1 showing vanilla GD behavior as a baseline: Sharpness accurately tracks the 2/eta threshold)* ## Experimental Evidence: EoS is Everywhere The paper provides striking evidence that across different architectures (MLPs, CNNs, Transformers) and optimizers, the EoS remains a constant. ### 1. SignGD and Muon When training with **SignGD** or **Muon**, the standard $\ell_2$ sharpness often remains flat and well below $2/\eta$. However, when the authors plot their **Generalized Sharpness**, it follows the classic EoS trajectory: it sharpens until it hits the $2/\eta$ threshold and then plateaus.  *(Figure 5: In SignGD and Muon, normalized generalized sharpness hits the 2/eta line precisely, even when standard metrics fail.)* ### 2. Spectral and Block CD For agents using the Spectral norm, the stability dynamics are even more complex. The authors observed an **oscillatory regime before EoS**, where the network's predictions start to wobble before the global sharpness actually reaches the theoretical limit.  *(Figure 4: Spectral GD demonstrates the same EoS phenomenon on Transformers and MLPs.)* ## Theoretical Insight: Why 2/η? The authors provide a theoretical bridge in **Theorem 5.2**. They prove that for any quadratic objective, there exists a specific direction (the "maximizing direction" of the norm) where the update rule becomes exactly: $$\mathbf{w}_{t} = (1 - \eta S)^t \mathbf{w}_0$$ When $\eta > 2/S$, this term becomes less than $-1$, leading to divergent oscillations. This explains why the optimizer "self-stabilizes"—as soon as it pushes sharpness past $2/\eta$, the resulting instability forces the weights back into a flatter region of the landscape. ## Critical Analysis & Conclusion **Takeaway**: This work provides a unifying "spectral ruler." Whether you are using a standard Adam optimizer or a sophisticated block-diagonal spectral method, you can now predict the stability limit using the same mathematical framework. **Limitations**: * **Computational Cost**: Measuring generalized sharpness requires Frank-Wolfe restarts, which is expensive for massive LLMs. * **The Gap**: In some cases (like ResNet20), the sharpness hovers *slightly above* $2/\eta$. The authors attribute this to "chaotic oscillatory dynamics" in multi-dimensional subspaces, but a closed-form explanation for the exact gap remains elusive. **Future Work**: This framework opens the door to designing **geometry-aware schedulers** that automatically adjust the learning rate $\eta$ based on the generalized sharpness, potentially preventing training collapses in large-scale frontier models.