Omni-Diffusion is the first any-to-any multimodal foundation model built entirely on a mask-based discrete diffusion backbone. It unifies text, image, and speech processing by modeling their joint distribution over discrete tokens, achieving state-of-the-art results in both understanding and generation tasks.
TL;DR
Omni-Diffusion is a breakthrough foundation model that abandons the traditional Autoregressive (AR) paradigm in favor of a Mask-based Discrete Diffusion architecture. It treats text, images, and speech as "first-class citizens" in a single discrete token space, enabling true any-to-any capability (e.g., speech-to-image or spoken VQA) with superior inference efficiency and semantic alignment.
The Motivation: Why Move Beyond Autoregression?
For years, the Multimodal Large Language Model (MLLM) recipe has been: LLM Backbone + Adapters + External Generators. While successful, this approach has two fatal flaws:
- Inefficiency: Generating high-resolution data (like audio or images) via AR is painstakingly slow because every token must wait for the previous one.
- Semantic Disconnect: Using an LLM to "steer" an external diffusion model (like Stable Diffusion) means the model doesn't intrinsically understand the pixels it generates; it just predicts the hidden states to trigger them.
Omni-Diffusion asks: What if the backbone itself was a diffusion model? By using a mask-based diffusion process, the model can predict tokens in parallel and learn a truly joint distribution across all modalities.
Methodology: The Unified Token Universe
The core of Omni-Diffusion is its ability to map different physical signals into a unified discrete vocabulary:
- Images: Tokenized via MAGViT-v2 into 8,192 discrete codes.
- Speech: Quantized via GLM-4-Voice into 16,384 discrete tokens.
- Text: Standard BPE tokens.

The Diffusion "Secret Sauce"
Unlike continuous diffusion (which adds noise), Discrete Diffusion masks tokens. During training, the model learns to "fill in the blanks" ([MASK]).
- Attenuated Tail-Pad Masking: A clever training trick that prevents the model from obsessing over "padding" tokens, allowing it to generate responses of variable lengths naturally.
- Position Penalty: During image generation, the model often creates repetitive patterns. The authors introduced a penalty to discourage simultaneous decoding of the start and end of a sequence, forcing a more coherent global structure.
Training Pipeline: A Three-Stage Evolution
Scaling a diffusion model to handle three modalities isn't easy. The team used a progressive approach:
- Visual-Language Pre-Alignment: Getting the model to understand the relationship between pixels and words (Captioning & T2I).
- Joint Alignment: Bringing speech into the fold (ASR & TTS).
- SDVI Enhancement: Using a custom Speech-Driven Visual Interaction (SDVI) dataset to enable complex tasks like "Look at this image and tell me (via speech) what is happening."

Experimental Results: Faster and Better
The results are striking. In speech tasks, Omni-Diffusion achieved a WER of 3.07 on LibriTTS, significantly better than the AR-based AnyGPT (8.50).
Efficiency Gains
One of the most impressive "party tricks" of Omni-Diffusion is its inference speed. Because it can decode tokens in parallel, reducing the number of diffusion steps doesn't destroy quality. It can generate high-quality images in just 10 steps, whereas some AR models would require hundreds of sequential forward passes.

Critical Insight & Conclusion
Omni-Diffusion isn't just "another multimodal model." It is a proof of concept for a non-autoregressive future. By modeling the joint distribution of tokens directly, it achieves a level of cross-modal fluidity (like Speech-to-Image) that is difficult for modular systems to replicate.
Limitations: While powerful, the model still relies on a frozen vocabulary. Future work might explore dynamic tokenization or hybrid AR-Diffusion architectures to capture the best of both worlds—the long-range reasoning of AR and the parallel generation of Diffusion.
Takeaway
For the CV and NLP community, this paper is a signal: the era of "Everything is a Sequence" might be evolving into "Everything is a Diffusion Process."
Reference: Li et al., "Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion", 2025.
