Scalable Training of Mixture-of-Experts Models with Megatron Core

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Scalable Training of Mixture-of-Experts Models with Megatron Core

[NVIDIA 2026] Megatron-Core MoE: Shattering the Three Walls of Trillion-Parameter Sparse Training

Summary

Problem

Method

Results

Takeaways

Abstract

NVIDIA introduces Megatron-Core MoE, a comprehensive open-source training stack designed to scale Mixture-of-Experts (MoE) models to trillions of parameters. The framework achieves state-of-the-art performance, reaching 1,233 TFLOPS/GPU on NVIDIA GB300 systems for models like DeepSeek-V3, by integrating novel multi-dimensional parallelism and system-level fusions.

TL;DR

NVIDIA has released a technical report on Megatron-Core MoE, a production-ready stack that scales Mixture-of-Experts (MoE) models to the trillion-parameter frontier. By introducing Parallel Folding (decoupled parallelism) and breaking the Memory, Communication, and Compute Walls, they have achieved over 1,200 TFLOPS/GPU on the Blackwell (GB300) platform, making the training of models like DeepSeek-V3 and Qwen3 significantly more efficient.

The "Sparsity Paradox": Why MoE is a Systems Nightmare

In a dense model, parameters and compute scale in lockstep. In MoE, they diverge. A model like DeepSeek-V3 has 685B total parameters but only 37B active parameters per token.

This creates a "Sparsity Paradox": you need hundreds of GPUs to hold the memory for the experts, but because the per-token compute is so low, the communication overhead of moving tokens between those experts (All-to-All) starts to dominate the execution time. NVIDIA identifies this as the three walls:

The Memory Wall: Experts must stay in memory, even when idle.
The Communication Wall: Expert Parallelism (EP) saturates inter-node bandwidth.
The Compute Wall: Small expert GEMMs underutilize the GPU, and host overhead creates "bubbles."

Methodology: The Architecture of Efficiency

1. Parallel Folding: Breaking the Dense-Sparse Mismatch

The most significant architectural shift is Parallel Folding. Traditionally, frameworks forced Attention and MoE layers to share the same Tensor Parallel (TP) and Data Parallel (DP) configuration. Megatron-Core MoE decouples them.

Attention uses high TP to shard large query/key matrices.
MoE uses EP with TP=1 to maintain full-width expert GEMMs for peak efficiency.

Parallel Folding Concept

2. Solving for Communication (DeepEP & HybridEP)

To scale MoE, NVIDIA optimized the Token Dispatcher. HybridEP (for GB200 NVL72) and DeepEP (for H100) use RDMA and hardware-accelerated kernels to maximize bandwidth. Furthermore, they implement a Merged FWD-BWD overlap strategy (similar to DualPipe), hiding All-to-All latency behind expert computation.

3. Sync-Free MoE & CUDA Graphs

One of the "hallmark" challenges of MoE is that expert shapes are dynamic (the host doesn't know how many tokens go to which expert). This normally requires a GPU-to-CPU sync, killing performance. Megatron-Core introduces:

Device-Initiated Kernels: GEMMs that read shapes directly from GPU memory.
Paged Stashing: A memory management system that allows CUDA Graphs to handle dynamic workloads by reusing static "temp" buffers.

Sync-Free Logic

Experiments: Dominating the Benchmarks

NVIDIA tested the stack on DeepSeek-V3-685B and Qwen3-235B.

GB300 (Blackwell): Achieved a staggering 1,233 TFLOPS (MXFP8).
H100 (Hopper): Achieved 368 TFLOPS using Blockwise FP8.

Even in Long-Context (128K) scenarios, the system sustained over 1,100 TFLOPS by using Dynamic Context Parallelism, which adaptively resizes the parallel degree based on batch sequence lengths.

| Model | System | Precision | TFLOPS/GPU | | :--- | :--- | :--- | :--- | | DeepSeek-V3 | GB300 | MXFP8 | 1233 | | DeepSeek-V3 | GB200 | MXFP8 | 1048 | | Qwen3-235B | GB200 | MXFP8 | 919 |

Throughput Comparison

Critical Insights: Future-Proofing MoE

The transition to NVFP4 (4-bit) on the Blackwell platform is a game-changer. NVIDIA's recipe (using Random Hadamard Transforms and Stochastic Rounding) shows that we can now train at 4-bit precision with minimal loss in accuracy.

However, the report also acknowledges that as GPUs get faster, we are becoming Host Bound. The future of LLM systems isn't just bigger GEMMs; it's about eliminating the CPU from the critical path via the "Sync-Free" technologies pioneered here.

Conclusion

Megatron-Core MoE isn't just a library; it's a blueprint for the next generation of AI infrastructure. By treating memory, communication, and compute as a unified system, NVIDIA has provided the industry with the tools to train the next generation of "Reasoning" and "Super-Intelligence" models efficiently.

Find Similar Papers

Try Our Examples

Search for recent papers using Expert Parallelism (EP) that also implement heterogeneous layouts for attention and MLP layers similar to Parallel Folding.
Which studies first introduced the concept of "Sync-Free" or "Device-Initiated" kernels to resolve dynamic shape bottlenecks in deep learning frameworks?
Explore how the Multi-Latent Attention (MLA) architecture from DeepSeek-V3 interacts with low-precision formats like NVFP4 compared to standard Multi-Head Attention.

Contents

[NVIDIA 2026] Megatron-Core MoE: Shattering the Three Walls of Trillion-Parameter Sparse Training

1. TL;DR

2. The "Sparsity Paradox": Why MoE is a Systems Nightmare

3. Methodology: The Architecture of Efficiency

3.1. 1. Parallel Folding: Breaking the Dense-Sparse Mismatch

3.2. 2. Solving for Communication (DeepEP & HybridEP)

3.3. 3. Sync-Free MoE & CUDA Graphs

4. Experiments: Dominating the Benchmarks

5. Critical Insights: Future-Proofing MoE

6. Conclusion