UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Scholar Search

Scholar QA

Pricing

TrueCite

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

[UniG2U-Bench] Does Drawing Help Thinking? Unveiling the "Alignment Tax" in Unified AI

Summary

Problem

Method

Results

Takeaways

Abstract

UniG2U-Bench is a large-scale diagnostic benchmark designed to evaluate "Generation-to-Understanding" (G2U) capabilities in over 30 unified multimodal models. By pairing models with their base VLMs across 3,000 tasks, it systematically identifies where generative capacity aids or hinders visual reasoning.

Executive Summary

TL;DR: The promise of unified multimodal models (like Janus, Show-o, and OmniGen) is that "creating" an image implies "understanding" it. However, UniG2U-Bench—the most comprehensive diagnostic testbed to date—reveals a more nuanced reality. While joint training often imposes an "Alignment Tax" that degrades general perception, it provides a unique structural advantage for spatial intelligence. Crucially, the Generate-then-Answer (GtA) paradigm only works when the model's "Visual Chain-of-Thought" is high-fidelity; otherwise, it propagates errors that lead to reasoning collapse.

Background: This work is a critical "reality check" for the multimodal community, moving beyond SOTA-chasing to provide a mechanistic look at the trade-offs of architectural unification.

The Core Conflict: Motivation vs. Reality

The research is driven by Richard Feynman’s adage: "What I cannot create, I do not understand." If a model can generate a complex geometric construction, it should theoretically be better at solving the geometry problem.

The Pain Point: Prior benchmarks (like MME-Unify) asked "can it answer?" and "can it draw?" separately. They didn't ask: "Does the drawing make the answering better?" UniG2U-Bench fills this gap with 3,000 samples across 30 tasks, specifically choosing areas where visual externalization should help (e.g., puzzles, spatial tracking, geometry).

Methodology: Isolating the G2U Gain

To find the truth, the authors used a "Base-Unified Pairing" strategy. They compared unified models (E2E or Decoupled) against their exact non-generative VLM counterparts (e.g., comparing Bagel to its Qwen2.5-VL base).

The Two Protocols

Direct: Input Image $\to$ Answer.
GtA (Generate-then-Answer): Input Image $\to$ Generated Intermediate Rationale $\to$ Final Answer.

Model Taxonomy and Inference Protocols Figure 1: Taxonomy of Unified Multimodal Models (UMMs) evaluated in the benchmark.

Key Finding 1: The "Alignment Tax"

The most striking result is that unified models often perform worse than their base models on standard tasks. Integrating generation isn't free. The joint training objective introduces "noise" or interference that dilutes the model’s fine-grained discriminative power.

However, there is a silver lining: Spatial Intelligence. In tasks involving mental rotation, object motion, and visual illusions, the generative training acts as a structural regularizer, helping the model maintain a more coherent internal map of 3D space than pure VLMs.

Performance Radar Chart Figure 2: Radar chart showing that while overall performance often dips, specific spatial gains are visible.

Key Finding 2: The Double-Edged Sword of Visual CoT

Does explicit drawing (GtA) help?

In Percepetion tasks: No. It's redundant. The model draws well but doesn't need the drawing.
In Logic/Math tasks: No. The models aren't precise enough. A slightly skewed "auxiliary line" in a geometry problem leads to a totally wrong answer (Error Propagation).
In Transformation-Intensive tasks: Yes! For Mazes and Sliding Puzzles, GtA acts as an external cognitive workspace, reducing the memory load needed to track states.

GtA Failure Cases Figure 3: Examples of "Capability Failure" where poor generation destroys the reasoning path.

Critical Insight: RA vs. AL

The authors introduce two vital metrics to diagnose GtA:

Reasoning-to-Visual Alignment (RA): Did the model draw what it was told to? (Fidelity)
Answer-to-Visual Alignment (AL): Is the final answer consistent with the new drawing? (Consitency)

The takeaway: High alignment is necessary but not sufficient. For G2U to work, the model needs to operate in a "sweet spot" where the task requires a visual scaffold and the model is capable of generating it accurately.

Conclusion: The Path Forward

UniG2U-Bench proves that "unification" is not a magic bullet. To unlock true G2U synergy, the industry needs:

Reliability-aware generation: Models must know when their drawing is too low-quality to trust.
Structural objectives: Training needs to prioritize geometric and physical constraints over "pretty" pixels.
Closed-loop refinement: Allowing the model to "erase and redraw" if the reasoning doesn't add up.

This benchmark provides the compass for the next generation of truly unified multimodal agents.

Find Similar Papers

Try Our Examples

Search for recent papers on "alignment tax" or "objective interference" specifically in unified multimodal models that combine autoregressive and diffusion objectives.
Which studies first introduced the concept of "Visual Chain-of-Thought," and how does the GtA protocol in UniG2U-Bench improve upon those early implementations?
Find research exploring how generative pretraining acts as a structural regularizer for spatial reasoning tasks in non-unified Vision-Language Models.

Contents

[UniG2U-Bench] Does Drawing Help Thinking? Unveiling the "Alignment Tax" in Unified AI

1. Executive Summary

2. The Core Conflict: Motivation vs. Reality

3. Methodology: Isolating the G2U Gain

3.1. The Two Protocols

4. Key Finding 1: The "Alignment Tax"

5. Key Finding 2: The Double-Edged Sword of Visual CoT

6. Critical Insight: RA vs. AL

7. Conclusion: The Path Forward