Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

[arXiv 2026] Nexus: Achieving Better Generalization via the Geometry of Common Minima

总结

问题

方法

结果

要点

摘要

The paper introduces Nexus, a novel second-order gradient-based optimization framework designed for Large Language Model (LLM) pretraining. It demonstrates that by maximizing gradient similarity across diverse data sources to reach a "common minima," models can achieve significantly better downstream generalization and out-of-distribution (OOD) performance while maintaining the same pretraining loss.

TL;DR

LLM pretraining usually treats all data as one big average, but this hides a geometric inefficiency. Nexus is a new optimizer wrapper that forces the model to find the "consensus" point between different data sources (code, math, text). The result? A model that achieves the same pretraining loss as AdamW but executes logic and reasoning tasks with up to 15% higher accuracy. It proves that where you land in the loss landscape matters as much as how low you go.

The Geometric Insight: Sum vs. Intersection

When training on multiple domains $L_{1}, L_{2} ... L_{K}$ , we usually minimize $L_{t r ain} = \frac{1}{K} \sum L_{k}$ . Mathematically, there are two ways to achieve a low total loss:

Sum of Minima: The model finds a "middle ground" that is actually far from the ideal point of any single task. (High variance, poor OOD generalization).
Intersection of Minima (Nexus): The model finds a point that is geometrically close to the specific minimizers of all tasks. (Low variance, superior generalization).

The authors hypothesize that Closeness—the distance between these task-specific minima—is a key second-order property for generalization, alongside the well-known property of Flatness.

Sum vs Intersection Architecture Figure 1: (a) Distant minima lead to poor OOD performance; (b) Close/Intersecting minima lead to a common low-loss region for downstream tasks.

Methodology: How Nexus Finds the Sweet Spot

Directly calculating the distance to every task's ideal minimum is computationally impossible. Nexus circumvents this by using Gradient Similarity as a proxy.

The Intuition: If the gradients of Task A and Task B always point in the same direction, their minima must be in the same place.

Nexus implements this via a Dual-Loop Mechanism:

Inner Loop: Performs a few steps of Normalized SGD to see where the gradient is "heading" across different mini-batches.
Outer Loop: Uses the "displacement" (the difference between the start and end of the inner loop) as a pseudo-gradient. This displacement naturally contains the Hessian-gradient product, which effectively maximizes the cosine similarity between task gradients.

Nexus Intuition Figure 2: The inner loop displacement (red) acts as a regularizer that pulls the model toward the axis of consensus.

Experimental Results: Scaling the Generalization Gap

The most striking aspect of Nexus is that it offers a "free lunch" in terms of pretraining loss. Across various model sizes (130M to 3B), the pretraining loss curve for Nexus is virtually indistinguishable from AdamW, yet the downstream benchmarks tell a different story.

Key Performance Metrics (3B Model):

GSM8k (Math Reasoning): +15.0% Accuracy.
MATH500: +8.0% Accuracy.
HumanEval (Coding): +4.0% Accuracy.
OOD Validation Loss: -0.012.

Main Results Table

Scaling Findings

The advantage of Nexus amplifies with scale. As models get larger and are trained on more tokens (up to 4x Chinchilla), the gap between Nexus and standard AdamW widens. This suggests that as models become more over-parameterized, they have more "geometric freedom" to choose a consensus-based path without sacrificing training efficiency.

Critical Analysis & Conclusion

The success of Nexus challenges the industry's reliance on "Validation Loss" as the definitive metric for model progress during pretraining. It reveals that two models with the exact same loss can have vastly different capabilities depending on the implicit bias of the optimizer.

Limitations:

Optimizer Incompatibility: Nexus currently doesn't play well with Muon (a recent orthogonal optimizer), possibly due to complex interactions with Newton-Schulz iterations.
Memory Overhead: It requires maintaining an auxiliary inner_model, though the authors suggest this can be mitigated with CPU offloading.

Future Outlook: As LLM pretraining moves from being compute-bound (waiting for GPUs) to data-bound (finding high-quality data), optimizing the path the model takes through that data becomes paramount. Nexus is a significant step toward a generation of optimizers that don't just memorize data, but find the shared logic between domains.

发现相似论文

试试这些示例

Search for recent papers that investigate the "Same Pretraining Loss, Better Downstream" phenomenon in Transformers beyond the scope of optimizer implicit bias.
Which study first formally defined the 'Intersection of Minima' in multi-task learning, and how does Nexus's second-order approximation differ from previous Gradient Agreement methods like PCGrad?
Explore research that applies gradient similarity optimization or consensus-seeking algorithms to multimodal pretraining (e.g., Vision-Language models) to improve cross-domain alignment.

[arXiv 2026] Nexus: Achieving Better Generalization via the Geometry of Common Minima

1. TL;DR

2. The Geometric Insight: Sum vs. Intersection

3. Methodology: How Nexus Finds the Sweet Spot

4. Experimental Results: Scaling the Generalization Gap

4.1. Scaling Findings

5. Critical Analysis & Conclusion