NVIDIA introduces Megatron-Core MoE, a comprehensive open-source training stack designed to scale Mixture-of-Experts (MoE) models to trillions of parameters. The framework achieves state-of-the-art performance, reaching 1,233 TFLOPS/GPU on NVIDIA GB300 systems for models like DeepSeek-V3, by integrating novel multi-dimensional parallelism and system-level fusions.
TL;DR
NVIDIA has released a technical report on Megatron-Core MoE, a production-ready stack that scales Mixture-of-Experts (MoE) models to the trillion-parameter frontier. By introducing Parallel Folding (decoupled parallelism) and breaking the Memory, Communication, and Compute Walls, they have achieved over 1,200 TFLOPS/GPU on the Blackwell (GB300) platform, making the training of models like DeepSeek-V3 and Qwen3 significantly more efficient.
The "Sparsity Paradox": Why MoE is a Systems Nightmare
In a dense model, parameters and compute scale in lockstep. In MoE, they diverge. A model like DeepSeek-V3 has 685B total parameters but only 37B active parameters per token.
This creates a "Sparsity Paradox": you need hundreds of GPUs to hold the memory for the experts, but because the per-token compute is so low, the communication overhead of moving tokens between those experts (All-to-All) starts to dominate the execution time. NVIDIA identifies this as the three walls:
- The Memory Wall: Experts must stay in memory, even when idle.
- The Communication Wall: Expert Parallelism (EP) saturates inter-node bandwidth.
- The Compute Wall: Small expert GEMMs underutilize the GPU, and host overhead creates "bubbles."
Methodology: The Architecture of Efficiency
1. Parallel Folding: Breaking the Dense-Sparse Mismatch
The most significant architectural shift is Parallel Folding. Traditionally, frameworks forced Attention and MoE layers to share the same Tensor Parallel (TP) and Data Parallel (DP) configuration. Megatron-Core MoE decouples them.
- Attention uses high TP to shard large query/key matrices.
- MoE uses EP with TP=1 to maintain full-width expert GEMMs for peak efficiency.

2. Solving for Communication (DeepEP & HybridEP)
To scale MoE, NVIDIA optimized the Token Dispatcher. HybridEP (for GB200 NVL72) and DeepEP (for H100) use RDMA and hardware-accelerated kernels to maximize bandwidth. Furthermore, they implement a Merged FWD-BWD overlap strategy (similar to DualPipe), hiding All-to-All latency behind expert computation.
3. Sync-Free MoE & CUDA Graphs
One of the "hallmark" challenges of MoE is that expert shapes are dynamic (the host doesn't know how many tokens go to which expert). This normally requires a GPU-to-CPU sync, killing performance. Megatron-Core introduces:
- Device-Initiated Kernels: GEMMs that read shapes directly from GPU memory.
- Paged Stashing: A memory management system that allows CUDA Graphs to handle dynamic workloads by reusing static "temp" buffers.

Experiments: Dominating the Benchmarks
NVIDIA tested the stack on DeepSeek-V3-685B and Qwen3-235B.
- GB300 (Blackwell): Achieved a staggering 1,233 TFLOPS (MXFP8).
- H100 (Hopper): Achieved 368 TFLOPS using Blockwise FP8.
Even in Long-Context (128K) scenarios, the system sustained over 1,100 TFLOPS by using Dynamic Context Parallelism, which adaptively resizes the parallel degree based on batch sequence lengths.
| Model | System | Precision | TFLOPS/GPU | | :--- | :--- | :--- | :--- | | DeepSeek-V3 | GB300 | MXFP8 | 1233 | | DeepSeek-V3 | GB200 | MXFP8 | 1048 | | Qwen3-235B | GB200 | MXFP8 | 919 |

Critical Insights: Future-Proofing MoE
The transition to NVFP4 (4-bit) on the Blackwell platform is a game-changer. NVIDIA's recipe (using Random Hadamard Transforms and Stochastic Rounding) shows that we can now train at 4-bit precision with minimal loss in accuracy.
However, the report also acknowledges that as GPUs get faster, we are becoming Host Bound. The future of LLM systems isn't just bigger GEMMs; it's about eliminating the CPU from the critical path via the "Sync-Free" technologies pioneered here.
Conclusion
Megatron-Core MoE isn't just a library; it's a blueprint for the next generation of AI infrastructure. By treating memory, communication, and compute as a unified system, NVIDIA has provided the industry with the tools to train the next generation of "Reasoning" and "Super-Intelligence" models efficiently.
