Non-Euclidean Gradient Descent Operates at the Edge of Stability

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Non-Euclidean Gradient Descent Operates at the Edge of Stability

[NeurIPS 2024] Non-Euclidean Gradient Descent: Unifying the Edge of Stability Across Optimizers

总结

问题

方法

结果

要点

摘要

This paper introduces a unified framework to study the "Edge of Stability" (EoS) phenomenon across non-Euclidean optimization algorithms. By defining a generalized sharpness measure based on arbitrary norms, the authors demonstrate that diverse optimizers like SignGD, Muon, and Block Coordinate Descent all exhibit progressive sharpening followed by stabilization at the 2/η threshold.

TL;DR

The "Edge of Stability" (EoS)—the phenomenon where training sharpness hovers at the limit of convergence ( $2/ η$ )—is not just for standard Gradient Descent. This paper proves, through a new metric called Generalized Sharpness, that EoS is a universal law governing virtually all first-order optimizers, including SignGD, Muon, and Block Coordinate Descent.

Background: The Mystery of the $2/ η$ Ceiling

In classical optimization theory, we assume a function is $L$ -smooth and keep the step size $η < 2/ L$ to ensure a monotonic decrease in loss. However, deep learning famously breaks this rule. In 2021, Cohen et al. observed that neural networks undergo Progressive Sharpening, where the top eigenvalue of the Hessian ( $λ_{ma x}$ ) grows until it hits the "Edge of Stability" at $2/ η$ . At this point, the loss begins to oscillate but training continues to progress.

Until now, this was largely a Euclidean story. But modern LLMs use optimizers like Muon (Spectral norm) or SignGD ( $ℓ_{\infty}$ norm). Do these follow the same rules? This paper says: Yes, but you’ve been measuring sharpness with the wrong ruler.

The Core Concept: Generalized Sharpness

The authors argue that "sharpness" must be defined relative to the geometry of the optimizer. If your optimizer moves based on the $ℓ_{\infty}$ norm (like SignGD), measuring the $ℓ_{2}$ spectral radius is meaningless.

They define Generalized Sharpness as:

abla^2 \mathcal{L}(\mathbf{w}) \mathbf{d}$$ This definition changes based on the norm $\|\cdot\|$: * **$\ell_2$ Norm**: Recovers standard sharpness ($\lambda_{max}(H)$). * **$\ell_\infty$ Norm**: Equivalent to a Max-Cut type problem (Ising spin glass), which is NP-hard but describes SignGD stability. * **Spectral Norm**: Describes the stability limit for the high-performance Muon optimizer. ### Architecture-Aware Analysis The authors utilize the **Frank-Wolfe algorithm** to solve this maximization problem during training, providing the first real-time look at stability for "exotic" geometries. ![Model Architecture and Evolution](https://cdn.atominnolab.com/wisdoc/jobs/20260309-0f1fd5b5-ec7e-445a-8da0-a8b90fcceed4/page_003_block_004.png) *(Figure 1 showing vanilla GD behavior as a baseline: Sharpness accurately tracks the 2/eta threshold)* ## Experimental Evidence: EoS is Everywhere The paper provides striking evidence that across different architectures (MLPs, CNNs, Transformers) and optimizers, the EoS remains a constant. ### 1. SignGD and Muon When training with **SignGD** or **Muon**, the standard $\ell_2$ sharpness often remains flat and well below $2/\eta$. However, when the authors plot their **Generalized Sharpness**, it follows the classic EoS trajectory: it sharpens until it hits the $2/\eta$ threshold and then plateaus. ![Normalized Non-Euclidean GD](https://cdn.atominnolab.com/wisdoc/jobs/20260309-0f1fd5b5-ec7e-445a-8da0-a8b90fcceed4/page_008_block_011.png) *(Figure 5: In SignGD and Muon, normalized generalized sharpness hits the 2/eta line precisely, even when standard metrics fail.)* ### 2. Spectral and Block CD For agents using the Spectral norm, the stability dynamics are even more complex. The authors observed an **oscillatory regime before EoS**, where the network's predictions start to wobble before the global sharpness actually reaches the theoretical limit. ![Spectral GD Results](https://cdn.atominnolab.com/wisdoc/jobs/20260309-0f1fd5b5-ec7e-445a-8da0-a8b90fcceed4/page_007_block_000.png) *(Figure 4: Spectral GD demonstrates the same EoS phenomenon on Transformers and MLPs.)* ## Theoretical Insight: Why 2/η? The authors provide a theoretical bridge in **Theorem 5.2**. They prove that for any quadratic objective, there exists a specific direction (the "maximizing direction" of the norm) where the update rule becomes exactly: $$\mathbf{w}_{t} = (1 - \eta S)^t \mathbf{w}_0$$ When $\eta > 2/S$, this term becomes less than $-1$, leading to divergent oscillations. This explains why the optimizer "self-stabilizes"—as soon as it pushes sharpness past $2/\eta$, the resulting instability forces the weights back into a flatter region of the landscape. ## Critical Analysis & Conclusion **Takeaway**: This work provides a unifying "spectral ruler." Whether you are using a standard Adam optimizer or a sophisticated block-diagonal spectral method, you can now predict the stability limit using the same mathematical framework. **Limitations**: * **Computational Cost**: Measuring generalized sharpness requires Frank-Wolfe restarts, which is expensive for massive LLMs. * **The Gap**: In some cases (like ResNet20), the sharpness hovers *slightly above* $2/\eta$. The authors attribute this to "chaotic oscillatory dynamics" in multi-dimensional subspaces, but a closed-form explanation for the exact gap remains elusive. **Future Work**: This framework opens the door to designing **geometry-aware schedulers** that automatically adjust the learning rate $\eta$ based on the generalized sharpness, potentially preventing training collapses in large-scale frontier models.

发现相似论文

试试这些示例

Search for recent papers that extend the Edge of Stability analysis to stochastic optimization settings (SGD) or adaptive learning rate schedulers like Lion.
Which paper originally introduced the concept of "Directional Smoothness," and how does the current work's use of it differ from its original application in convergence proofs?
Explore if the generalized sharpness measure proposed here has been applied to analyze the stability of Reinforcement Learning algorithms using non-Euclidean policy updates.

[NeurIPS 2024] Non-Euclidean Gradient Descent: Unifying the Edge of Stability Across Optimizers

1. TL;DR

2. Background: The Mystery of the $2/\eta$ Ceiling

3. The Core Concept: Generalized Sharpness

3.1. Architecture-Aware Analysis

4. Experimental Evidence: EoS is Everywhere

4.1. 1. SignGD and Muon

4.2. 2. Spectral and Block CD

5. Theoretical Insight: Why 2/η?

6. Critical Analysis & Conclusion

TL;DR

Background: The Mystery of the 2/η Ceiling

The Core Concept: Generalized Sharpness

发现相似论文

试试这些示例

Background: The Mystery of the $2/ η$ Ceiling