veScale-FSDP: Flexible and High-Performance FSDP at Scale

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

veScale-FSDP: Flexible and High-Performance FSDP at Scale

[arXiv 2026] veScale-FSDP: Breaking the Rigidity of Large-Scale Distributed Training

Summary

Problem

Method

Results

Takeaways

Abstract

veScale-FSDP is a redesigned Fully Sharded Data Parallel system that introduces the RaggedShard abstraction and a structure-aware planning algorithm. It enables high-performance training of LLMs up to 10K GPUs while natively supporting block-wise quantization and non-element-wise optimizers (e.g., Muon, Shampoo).

Executive Summary

TL;DR: veScale-FSDP is a next-generation training framework from ByteDance that solves the "sharding misalignment" problem in FSDP. By replacing rigid row-wise sharding with a novel RaggedShard format and a structure-aware planning algorithm, it achieves a 5-66% throughput boost and 30% memory savings, supporting advanced optimizers like Muon and block-wise quantization out-of-the-box.

Background: In the landscape of Large Language Model (LLM) training, FSDP (Fully Sharded Data Parallel) is the industry standard. However, as models move toward complex structures (MoE) and advanced numerical methods (8-bit quantization), the "one-size-fits-all" sharding of existing systems (DeepSpeed, PyTorch FSDP2) has become a bottleneck for both developer velocity and hardware efficiency.

The Core Conflict: Math vs. Memory Layout

Modern SOTA models are moving away from simple element-wise operations. Techniques like 8-bit Adam or Muon optimizers operate on specific 2D blocks (e.g., 32x32 or 128x128 tiles).

Conventional FSDP systems shard tensors based on the number of GPUs. If you have a tensor that doesn't divide perfectly into your GPU count, the system either:

Shards mid-block: This forces extra communication to "re-assemble" the block for the optimizer.
Pads excessively: This wastes memory and inflates collective communication volume.

Existing systems like PyTorch FSDP2 also suffer from "interleaved copy" overhead, where tensors must be copied out of a contiguous communication buffer into fragmented per-parameter memory addresses, wasting up to 14% of training time.

Methodology: Flexibility Meets Performance

1. RaggedShard: The Flexible Foundation

veScale-FSDP introduces RaggedShard, a new placement for PyTorch DTensors. Unlike standard sharding, RaggedShard allows:

Custom Granularity: Define a "non-shardable" atomic unit (e.g., a quantization block).
Uneven Distribution: Tensors can be distributed across GPUs in different counts to accommodate prime-numbered dimensions or specific load-balancing needs.

RaggedShard Flexibility

2. Structure-Aware Planning

To maximize network utilization, veScale-FSDP treats the arrangement of these "ragged" tensors in the communication buffer as an optimization problem. The goal is to minimize padding while ensuring that every shard boundary aligns with a block boundary. The authors formulate this as an NP-hard problem and solve it using a highly efficient polynomial-time heuristic based on Dynamic Programming (DP).

3. Distributed Buffer (DBuffer)

To eliminate the "interleaved copy" tax, the system introduces DBuffer. It provides a persistent address mapping between the global collective buffer and the individual parameters. This enables zero-copy access—the data stays in the buffer for computation, and kernels are fused to reduce CUDA overhead.

Distributed Buffer Architecture

Experimental Validation

The system was evaluated against DeepSpeed and Megatron-FSDP on massive benchmarks, including LLaMA-3-70B and MoE models.

Throughput: veScale-FSDP achieves significantly higher tokens/sec, particularly on MoE models where it outpaces baselines by up to 66% due to better handling of sparse expert communication.
Memory Usage: Peak memory is reduced by up to 30%, which is critical for training trillion-parameter models on shared clusters where "Out of Memory" (OOM) errors are the primary cause of job failure.
Scalability: The system demonstrated linear scaling up to 10,000 GPUs, maintaining over 47% Model FLOPS Utilization (MFU) even with complex optimizers like Muon.

End-to-End Performance

Deep Insight: Why it Matters

The real value of veScale-FSDP isn't just the raw speed—it's the decoupling of model code from system code.

In previous high-performance setups (like Megatron-LM), researchers had to "hack" the model architecture to fit the parallelization strategy. veScale-FSDP allows a researcher to write standard PyTorch code and use advanced optimizers like Muon with just a few lines of code, while the underlying system handles the complex NP-hard layout logic and zero-copy communication.

Critical Analysis & Future Outlook

Takeaway: veScale-FSDP proves that we don't need to choose between the flexibility of PyTorch-style FSDP and the performance of specialized Megatron-style setups.

Limitations: The current implementation relies on a heuristic for the NP-hard planning problem. While it works for standard Transformers, extremely heterogeneous architectures might still see sub-optimal padding spikes.

Future Work: As the industry moves toward mixed-precision (FP8/INT8) and post-training quantization, the ability of the training system to "understand" block structures will be the defining factor of SOTA efficiency. veScale-FSDP has set the blueprint for this transition.

Find Similar Papers

Try Our Examples

Search for recent papers on structure-aware training and block-wise quantization methods similar to DeepSeek-V3 that impose specific requirements on distributed sharding.
Which paper first introduced the PyTorch DTensor abstraction, and how have subsequent works like veScale-FSDP modified its placement logic for specialized hardware layouts?
Explore how non-element-wise optimizers like Shampoo or Muon are being adapted for large-scale distributed training in frameworks other than veScale.

Contents

[arXiv 2026] veScale-FSDP: Breaking the Rigidity of Large-Scale Distributed Training

1. Executive Summary

2. The Core Conflict: Math vs. Memory Layout

3. Methodology: Flexibility Meets Performance

3.1. 1. RaggedShard: The Flexible Foundation

3.2. 2. Structure-Aware Planning

3.3. 3. Distributed Buffer (DBuffer)

4. Experimental Validation

5. Deep Insight: Why it Matters

6. Critical Analysis & Future Outlook