Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Omni-Diffusion: Breaking the Autoregressive Monopoly in Multimodal AI

总结

问题

方法

结果

要点

摘要

Omni-Diffusion is the first any-to-any multimodal foundation model built entirely on a mask-based discrete diffusion backbone. It unifies text, image, and speech processing by modeling their joint distribution over discrete tokens, achieving state-of-the-art results in both understanding and generation tasks.

TL;DR

Omni-Diffusion is a breakthrough foundation model that abandons the traditional Autoregressive (AR) paradigm in favor of a Mask-based Discrete Diffusion architecture. It treats text, images, and speech as "first-class citizens" in a single discrete token space, enabling true any-to-any capability (e.g., speech-to-image or spoken VQA) with superior inference efficiency and semantic alignment.

The Motivation: Why Move Beyond Autoregression?

For years, the Multimodal Large Language Model (MLLM) recipe has been: LLM Backbone + Adapters + External Generators. While successful, this approach has two fatal flaws:

Inefficiency: Generating high-resolution data (like audio or images) via AR is painstakingly slow because every token must wait for the previous one.
Semantic Disconnect: Using an LLM to "steer" an external diffusion model (like Stable Diffusion) means the model doesn't intrinsically understand the pixels it generates; it just predicts the hidden states to trigger them.

Omni-Diffusion asks: What if the backbone itself was a diffusion model? By using a mask-based diffusion process, the model can predict tokens in parallel and learn a truly joint distribution across all modalities.

Methodology: The Unified Token Universe

The core of Omni-Diffusion is its ability to map different physical signals into a unified discrete vocabulary:

Images: Tokenized via MAGViT-v2 into 8,192 discrete codes.
Speech: Quantized via GLM-4-Voice into 16,384 discrete tokens.
Text: Standard BPE tokens.

Omni-Diffusion Architecture

The Diffusion "Secret Sauce"

Unlike continuous diffusion (which adds noise), Discrete Diffusion masks tokens. During training, the model learns to "fill in the blanks" ([MASK]).

Attenuated Tail-Pad Masking: A clever training trick that prevents the model from obsessing over "padding" tokens, allowing it to generate responses of variable lengths naturally.
Position Penalty: During image generation, the model often creates repetitive patterns. The authors introduced a penalty to discourage simultaneous decoding of the start and end of a sequence, forcing a more coherent global structure.

Training Pipeline: A Three-Stage Evolution

Scaling a diffusion model to handle three modalities isn't easy. The team used a progressive approach:

Visual-Language Pre-Alignment: Getting the model to understand the relationship between pixels and words (Captioning & T2I).
Joint Alignment: Bringing speech into the fold (ASR & TTS).
SDVI Enhancement: Using a custom Speech-Driven Visual Interaction (SDVI) dataset to enable complex tasks like "Look at this image and tell me (via speech) what is happening."

Three-Stage Training Pipeline

Experimental Results: Faster and Better

The results are striking. In speech tasks, Omni-Diffusion achieved a WER of 3.07 on LibriTTS, significantly better than the AR-based AnyGPT (8.50).

Efficiency Gains

One of the most impressive "party tricks" of Omni-Diffusion is its inference speed. Because it can decode tokens in parallel, reducing the number of diffusion steps doesn't destroy quality. It can generate high-quality images in just 10 steps, whereas some AR models would require hundreds of sequential forward passes.

Sampling Efficiency Comparison

Critical Insight & Conclusion

Omni-Diffusion isn't just "another multimodal model." It is a proof of concept for a non-autoregressive future. By modeling the joint distribution of tokens directly, it achieves a level of cross-modal fluidity (like Speech-to-Image) that is difficult for modular systems to replicate.

Limitations: While powerful, the model still relies on a frozen vocabulary. Future work might explore dynamic tokenization or hybrid AR-Diffusion architectures to capture the best of both worlds—the long-range reasoning of AR and the parallel generation of Diffusion.

Takeaway

For the CV and NLP community, this paper is a signal: the era of "Everything is a Sequence" might be evolving into "Everything is a Diffusion Process."

Reference: Li et al., "Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion", 2025.

发现相似论文

试试这些示例

Find recent research papers that compare Masked Discrete Diffusion Models (MDMs) with Autoregressive Transformers for large-scale multimodal pre-training.
Which paper first introduced the Dream-7B architecture, and how does Omni-Diffusion modify its embedding layer and training objective for multimodal tokens?
Search for studies applying discrete diffusion or flow matching to real-time, low-latency speech-to-image or video generation tasks.

Omni-Diffusion: Breaking the Autoregressive Monopoly in Multimodal AI

1. TL;DR

2. The Motivation: Why Move Beyond Autoregression?

3. Methodology: The Unified Token Universe

3.1. The Diffusion "Secret Sauce"

4. Training Pipeline: A Three-Stage Evolution

5. Experimental Results: Faster and Better

5.1. Efficiency Gains

6. Critical Insight & Conclusion

6.1. Takeaway