UniG2U-Bench is a large-scale diagnostic benchmark designed to evaluate "Generation-to-Understanding" (G2U) capabilities in over 30 unified multimodal models. By pairing models with their base VLMs across 3,000 tasks, it systematically identifies where generative capacity aids or hinders visual reasoning.
Executive Summary
TL;DR: The promise of unified multimodal models (like Janus, Show-o, and OmniGen) is that "creating" an image implies "understanding" it. However, UniG2U-Bench—the most comprehensive diagnostic testbed to date—reveals a more nuanced reality. While joint training often imposes an "Alignment Tax" that degrades general perception, it provides a unique structural advantage for spatial intelligence. Crucially, the Generate-then-Answer (GtA) paradigm only works when the model's "Visual Chain-of-Thought" is high-fidelity; otherwise, it propagates errors that lead to reasoning collapse.
Background: This work is a critical "reality check" for the multimodal community, moving beyond SOTA-chasing to provide a mechanistic look at the trade-offs of architectural unification.
The Core Conflict: Motivation vs. Reality
The research is driven by Richard Feynman’s adage: "What I cannot create, I do not understand." If a model can generate a complex geometric construction, it should theoretically be better at solving the geometry problem.
The Pain Point: Prior benchmarks (like MME-Unify) asked "can it answer?" and "can it draw?" separately. They didn't ask: "Does the drawing make the answering better?" UniG2U-Bench fills this gap with 3,000 samples across 30 tasks, specifically choosing areas where visual externalization should help (e.g., puzzles, spatial tracking, geometry).
Methodology: Isolating the G2U Gain
To find the truth, the authors used a "Base-Unified Pairing" strategy. They compared unified models (E2E or Decoupled) against their exact non-generative VLM counterparts (e.g., comparing Bagel to its Qwen2.5-VL base).
The Two Protocols
- Direct: Input Image Answer.
- GtA (Generate-then-Answer): Input Image Generated Intermediate Rationale Final Answer.
Figure 1: Taxonomy of Unified Multimodal Models (UMMs) evaluated in the benchmark.
Key Finding 1: The "Alignment Tax"
The most striking result is that unified models often perform worse than their base models on standard tasks. Integrating generation isn't free. The joint training objective introduces "noise" or interference that dilutes the model’s fine-grained discriminative power.
However, there is a silver lining: Spatial Intelligence. In tasks involving mental rotation, object motion, and visual illusions, the generative training acts as a structural regularizer, helping the model maintain a more coherent internal map of 3D space than pure VLMs.
Figure 2: Radar chart showing that while overall performance often dips, specific spatial gains are visible.
Key Finding 2: The Double-Edged Sword of Visual CoT
Does explicit drawing (GtA) help?
- In Percepetion tasks: No. It's redundant. The model draws well but doesn't need the drawing.
- In Logic/Math tasks: No. The models aren't precise enough. A slightly skewed "auxiliary line" in a geometry problem leads to a totally wrong answer (Error Propagation).
- In Transformation-Intensive tasks: Yes! For Mazes and Sliding Puzzles, GtA acts as an external cognitive workspace, reducing the memory load needed to track states.
Figure 3: Examples of "Capability Failure" where poor generation destroys the reasoning path.
Critical Insight: RA vs. AL
The authors introduce two vital metrics to diagnose GtA:
- Reasoning-to-Visual Alignment (RA): Did the model draw what it was told to? (Fidelity)
- Answer-to-Visual Alignment (AL): Is the final answer consistent with the new drawing? (Consitency)
The takeaway: High alignment is necessary but not sufficient. For G2U to work, the model needs to operate in a "sweet spot" where the task requires a visual scaffold and the model is capable of generating it accurately.
Conclusion: The Path Forward
UniG2U-Bench proves that "unification" is not a magic bullet. To unlock true G2U synergy, the industry needs:
- Reliability-aware generation: Models must know when their drawing is too low-quality to trust.
- Structural objectives: Training needs to prioritize geometric and physical constraints over "pretty" pixels.
- Closed-loop refinement: Allowing the model to "erase and redraw" if the reasoning doesn't add up.
This benchmark provides the compass for the next generation of truly unified multimodal agents.
