WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2025] Wan-Weaver: Teaching LLMs to "Plan" Multi-modal Narratives without Interleaved Data
Summary
Problem
Method
Results
Takeaways
Abstract

Wan-Weaver is a unified multi-modal model employing a Mixture-of-Transformers (MoT) architecture specifically designed for interleaved text-image generation. By decoupling training into a "Planning Expert" (for layout and reasoning) and a "Visualization Expert" (for reference-based image synthesis), it achieves state-of-the-art results on the WeaverBench and OpenING benchmarks, even surpassing several commercial pipelines.

TL;DR

Interleaved text-image generation (like creating a travel blog or an illustrated story) is the "holy grail" of multi-modal AI. However, high-quality data of this type is incredibly rare. Wan-Weaver breaks this bottleneck by decomposing the task into Textual Planning and Visual Consistency. By training a Planner on synthetic text-proxy data and a Visualizer on reference-driven images, it achieves SOTA performance on both open-source and commercial benchmarks.

The "Broken Link" in Multi-modal Models

Current Unified Multi-modal Models (UMMs) usually excel at understanding both text and images, but when it comes to generating them interleaved, they fail. The reasons are two-fold:

  1. Data Scarcity: Real-world documents with perfectly aligned, high-quality text-image sequences are hard to find at scale.
  2. Long-range Coherence: Keeping the protagonist's face or the art style consistent over a 10-page illustrated article is technically grueling.

Previous works tried "joint training," but the semantic gap between pixels and tokens often causes "gradient interference," leading to poor image quality or logical collapses.

Methodology: The Power of Decoupling

Wan-Weaver introduces a Mixture-of-Transformers (MoT) framework. Instead of asking one model to do everything, it splits the brain into two experts.

1. The Planner Expert (The "Director")

The Planner decides where an image should go and what it should look like. To train this without interleaved data, the authors used a brilliant trick: Textual Proxies. They replaced images in long articles with dense, descriptive tags like <imagine>...detailed prompt...</imagine>. This allows the model to learn the "rhythm" of interleaved content using pure text processing.

2. The Visualizer Expert (The "Artist")

The Visualizer is a Diffusion Transformer (DiT) that listens to the Planner. It doesn't just generate an image from a prompt; it looks back at previous images through Reference-guided training. This ensures that if the first image had a green-shelled tortoise, the fifth image doesn't suddenly show a brown one.

Model Architecture

3. Dense Prompt Context Window (DPCW)

To fix information loss when converting visual context to text, the authors introduced DPCW. This mechanism defines a specific attention window around the "dense prompt," allowing the visualizer to "peer back" into the raw features of the preceding context, ensuring tighter alignment.

Experiments: Surpassing the Benchmarks

The authors introduced WeaverBench, a comprehensive test suite covering 14 everyday scenarios like "Travel Guides," "Food Cooking," and "Academic Research."

Key Results:

  • Instruction Accuracy: Wan-Weaver follows image-count constraints with 93.44% accuracy, far ahead of commercial models (~66%).
  • Visual Fidelity: It outperforms previous SOTA models like Emu3 and Anole across all metrics (Quality, Richness, Coherency).

Experimental Results

One interesting insight from the ablation study is that Decoupled Training (training the planner and visualizer separately) results in a much smoother and lower loss curve compared to joint training, proving that "separation of concerns" is key in multi-modal architectures.

Critical Analysis & Conclusion

Wan-Weaver proves that we don't necessarily need perfect "Nature-made" datasets to train complex multi-modal behaviors. Strategic decomposition and synthetic "proxy" data are sufficient to trigger emergent interleaved capabilities.

Limitations:

  • Sequential Bottleneck: Generation speed decreases as the sequence grows because the model must attend to an ever-expanding history.
  • Structural Layout: In complex cases, the model still occasionally collapses multiple images into a "grid" rather than distributing them through the text.

The Takeaway: Wan-Weaver sets a new bar for open-source multi-modal generation, demonstrating that a "Planning + Execution" architecture is currently the most robust way to handle long-range multi-modal consistency.


For more details, check out the project page: Wan-Weaver Project

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize "textual-proxy" data or synthetic captions to train the planning capabilities of multi-modal large language models.
  • Which paper first introduced the "Mixture-of-Transformers" (MoT) architecture for multi-modal unification, and how does Wan-Weaver modify its expert-routing?
  • Investigate contemporary research that applies "flow matching" or reference-guided DiTs to maintain entity consistency in long-form multi-modal storytelling.
Contents
[CVPR 2025] Wan-Weaver: Teaching LLMs to "Plan" Multi-modal Narratives without Interleaved Data
1. TL;DR
2. The "Broken Link" in Multi-modal Models
3. Methodology: The Power of Decoupling
3.1. 1. The Planner Expert (The "Director")
3.2. 2. The Visualizer Expert (The "Artist")
3.3. 3. Dense Prompt Context Window (DPCW)
4. Experiments: Surpassing the Benchmarks
4.1. Key Results:
5. Critical Analysis & Conclusion
5.1. Limitations: