veScale-FSDP is a redesigned Fully Sharded Data Parallel system that introduces the RaggedShard abstraction and a structure-aware planning algorithm. It enables high-performance training of LLMs up to 10K GPUs while natively supporting block-wise quantization and non-element-wise optimizers (e.g., Muon, Shampoo).
Executive Summary
TL;DR: veScale-FSDP is a next-generation training framework from ByteDance that solves the "sharding misalignment" problem in FSDP. By replacing rigid row-wise sharding with a novel RaggedShard format and a structure-aware planning algorithm, it achieves a 5-66% throughput boost and 30% memory savings, supporting advanced optimizers like Muon and block-wise quantization out-of-the-box.
Background: In the landscape of Large Language Model (LLM) training, FSDP (Fully Sharded Data Parallel) is the industry standard. However, as models move toward complex structures (MoE) and advanced numerical methods (8-bit quantization), the "one-size-fits-all" sharding of existing systems (DeepSpeed, PyTorch FSDP2) has become a bottleneck for both developer velocity and hardware efficiency.
The Core Conflict: Math vs. Memory Layout
Modern SOTA models are moving away from simple element-wise operations. Techniques like 8-bit Adam or Muon optimizers operate on specific 2D blocks (e.g., 32x32 or 128x128 tiles).
Conventional FSDP systems shard tensors based on the number of GPUs. If you have a tensor that doesn't divide perfectly into your GPU count, the system either:
- Shards mid-block: This forces extra communication to "re-assemble" the block for the optimizer.
- Pads excessively: This wastes memory and inflates collective communication volume.
Existing systems like PyTorch FSDP2 also suffer from "interleaved copy" overhead, where tensors must be copied out of a contiguous communication buffer into fragmented per-parameter memory addresses, wasting up to 14% of training time.
Methodology: Flexibility Meets Performance
1. RaggedShard: The Flexible Foundation
veScale-FSDP introduces RaggedShard, a new placement for PyTorch DTensors. Unlike standard sharding, RaggedShard allows:
- Custom Granularity: Define a "non-shardable" atomic unit (e.g., a quantization block).
- Uneven Distribution: Tensors can be distributed across GPUs in different counts to accommodate prime-numbered dimensions or specific load-balancing needs.

2. Structure-Aware Planning
To maximize network utilization, veScale-FSDP treats the arrangement of these "ragged" tensors in the communication buffer as an optimization problem. The goal is to minimize padding while ensuring that every shard boundary aligns with a block boundary. The authors formulate this as an NP-hard problem and solve it using a highly efficient polynomial-time heuristic based on Dynamic Programming (DP).
3. Distributed Buffer (DBuffer)
To eliminate the "interleaved copy" tax, the system introduces DBuffer. It provides a persistent address mapping between the global collective buffer and the individual parameters. This enables zero-copy access—the data stays in the buffer for computation, and kernels are fused to reduce CUDA overhead.

Experimental Validation
The system was evaluated against DeepSpeed and Megatron-FSDP on massive benchmarks, including LLaMA-3-70B and MoE models.
- Throughput: veScale-FSDP achieves significantly higher tokens/sec, particularly on MoE models where it outpaces baselines by up to 66% due to better handling of sparse expert communication.
- Memory Usage: Peak memory is reduced by up to 30%, which is critical for training trillion-parameter models on shared clusters where "Out of Memory" (OOM) errors are the primary cause of job failure.
- Scalability: The system demonstrated linear scaling up to 10,000 GPUs, maintaining over 47% Model FLOPS Utilization (MFU) even with complex optimizers like Muon.

Deep Insight: Why it Matters
The real value of veScale-FSDP isn't just the raw speed—it's the decoupling of model code from system code.
In previous high-performance setups (like Megatron-LM), researchers had to "hack" the model architecture to fit the parallelization strategy. veScale-FSDP allows a researcher to write standard PyTorch code and use advanced optimizers like Muon with just a few lines of code, while the underlying system handles the complex NP-hard layout logic and zero-copy communication.
Critical Analysis & Future Outlook
Takeaway: veScale-FSDP proves that we don't need to choose between the flexibility of PyTorch-style FSDP and the performance of specialized Megatron-style setups.
Limitations: The current implementation relies on a heuristic for the NP-hard planning problem. While it works for standard Transformers, extremely heterogeneous architectures might still see sub-optimal padding spikes.
Future Work: As the industry moves toward mixed-precision (FP8/INT8) and post-training quantization, the ability of the training system to "understand" block structures will be the defining factor of SOTA efficiency. veScale-FSDP has set the blueprint for this transition.
