MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

[CVPR 2025] MPDiT: Scaling Diffusion Transformers with Global-to-Local Hierarchies

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces MPDiT, a Multi-Patch Global-to-Local Diffusion Transformer that optimizes DiT architectures using a hierarchical design. It processes large patches in early blocks to capture global context and small patches in later blocks for local refinement, achieving SOTA results with significantly reduced computation.

Executive Summary

TL;DR: MPDiT (Multi-Patch Diffusion Transformer) breaks the isotropic tradition of DiT architectures by introducing a hierarchical global-to-local design. By processing coarse global structures with large patches and refining details with smaller patches, the model reduces GFLOPs by up to 50% and accelerates convergence by over 11x, achieving superior FID scores on ImageNet-256 and 512.

Background: While Transformers have superseded UNets in generative tasks, the quadratic cost of self-attention remains a bottleneck for high-resolution synthesis. MPDiT is a structural SOTA-bound modification that rethinks how spatial information should flow through a Diffusion Transformer.

Problem & Motivation: The Isotropic Tax

Existing Diffusion Transformers (e.g., DiT, SiT) are isotropic: they treat all tokens equally from the first layer to the last. This is computationally wasteful because:

Early layers primarily focus on global layout and coarse structures, which don't require high-resolution token grids.
Late layers refine textures and local edges, which do require density.

Furthermore, the authors identified that standard time embeddings (linear MLPs) and class embeddings (single tokens) are too "shallow" to capture the complex ODE/SDE dynamics of flow matching.

Methodology: The Global-to-Local Architecture

The core innovation is the Multi-Patch Transformer pipeline. Instead of fixed tokenization, the model employs a two-stage strategy:

Global Stage (N-k blocks): Uses a large patch size (p=4), resulting in only 64 tokens for a latent grid. This drastically cuts the quadratic attention cost.
Upsample & Refine (k blocks): An "Upsample Block" expands tokens to a finer resolution (p=2, 256 tokens). A skip connection from the original latent ensures no fine-grained detail is lost. Only the last few blocks (k=6) operate at this high-density token count.

Model Architecture

Advanced Conditioning

FNO Time Embedding: Inspired by Neural Operators, the authors use a 1D spectral convolution (MixedFNO) to learn smooth transitions across timesteps, providing a more continuous representation of the flow field.
Multi-token Class Embedding: Prepending multiple learnable tokens (m=16) instead of one allows the model to capture richer semantic relationships between labels and spatial features.

Experiments & Results: 11x Faster Convergence

MPDiT was benchmarked extensively on ImageNet. The results are striking because they improve both quality and speed simultaneously.

ImageNet-256: MPDiT-XL achieves a cfg-FID of 2.05 in just 240 epochs. For comparison, SiT needs 1,400 epochs to reach similar quality.
ImageNet-512: The efficiency gains are even more pronounced. MPDiT-XL reaches an FID of 2.47 using only ~43.5% of the GFLOPs of DiT-XL/2.

SOTA Comparison Table

Ablation Insights

The ablation study highlights that the Multi-patch strategy (GFLOPs reduction) and the FNO/Multi-token conditioning (FID improvement) are complementary. Switching to multiple class tokens alone provided a massive ~7 point FID drop, while FNO added another ~4 points.

Critical Analysis & Conclusion

Takeaway: MPDiT proves that we don't need "all tokens all the time." By aligning the architectural resolution with the natural generative process (coarse-to-fine), we can slash training budgets without sacrificing SOTA performance.

Limitations: While the efficiency gains are clear on ImageNet, the application to massive-scale Text-to-Video models (like Sora or Flux) is mentioned as future work. These "ultra-long" sequence tasks will truly test the limits of the Upsample Module’s ability to maintain coherence.

Future Outlook: We expect to see this hierarchical approach become the standard for high-resolution latent models (1K and beyond), where full-token attention is otherwise prohibitive.

Find Similar Papers

Try Our Examples

Search for recent papers that apply hierarchical or multi-scale patch tokenization strategies specifically to increase the efficiency of Diffusion Transformers or Flow Matching models.
Which study first applied Fourier Neural Operators (FNO) to temporal embeddings in generative modeling, and how does MPDiT's implementation of MixedFNO differ?
Identify research exploring the impact of multi-token class conditioning vs. AdaIN-based modulation on the convergence speed of large-scale generative vision models.

Contents

[CVPR 2025] MPDiT: Scaling Diffusion Transformers with Global-to-Local Hierarchies

1. Executive Summary

2. Problem & Motivation: The Isotropic Tax

3. Methodology: The Global-to-Local Architecture

3.1. Advanced Conditioning

4. Experiments & Results: 11x Faster Convergence

4.1. Ablation Insights

5. Critical Analysis & Conclusion