WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[CVPR 2025] TIGON: Resolving 3D Generation Ambiguity via Dual-Branch Text-Image Fusion
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces TIGON, a minimalist dual-branch 3D generative baseline that formalizes the task of Text–Image Conditioned 3D Generation. By combining a modality-specific Diffusion Transformer (DiT) architecture with lightweight cross-modal fusion, it achieves state-of-the-art performance in generating high-fidelity 3D assets that are both semantically aligned with text and visually faithful to image exemplars.

TL;DR

While image-conditioned 3D generation excels at texture but fails on hidden geometry, and text-to-3D provides structure but lacks detail, TIGON bridges the gap. By introducing a dual-branch Diffusion Transformer (DiT) with cross-modal bridges, TIGON allows users to provide both a visual reference and a textual description, resulting in 3D assets that are both semantically precise and visually stunning.

Background: The "Half-Seen" Problem in 3D

Most SOTA 3D generators belong to one of two camps:

  1. Image-to-3D: Captures great local detail but "hallucinates" the back of the object because it lacks semantic context.
  2. Text-to-3D: Understands what a "tiger" looks like globally but cannot match a specific artistic style or color palette provided by a user.

The authors identify a significant performance drop (up to 3x error increase) when an image-only model is given a "low-information" view (e.g., looking at a trophy from the top). TIGON was designed to solve this by using text to "fill in the blanks" of the image.

Methodology: The TIGON Architecture

TIGON avoids the common pitfall of mixing image and text tokens into a single transformer, which often leads to "semantic dilution." Instead, it maintains two specialized backbones:

1. Dual-Branch DiT

The model features an Image Branch (processing dense, pixel-aligned tokens) and a Text Branch (processing sparse, semantic CLIP tokens). This ensures that neither modality compromises the other's specialized feature space.

2. Early & Late Fusion

  • Early Fusion: "Cross-modal bridges" (linear layers) are placed between every block of the two transformers. These are zero-initialized to ensure training stability, gradually allowing information to flow between the visual and semantic branches.
  • Late Fusion: During the denoising process, TIGON averages the predicted velocity fields from both branches at every timestep.

TIGON Architecture Figure: The dual-branch pipeline showing feature exchange via cross-modal bridges and final prediction averaging.

Experiments: Quantitative Superiority

The results on Toys4K and UniLat1K demonstrate that joint conditioning isn't just a convenience—it's a massive performance booster. In scenarios where a single-view image is ambiguous, adding a text prompt allows TIGON to achieve a significantly lower Frechet Distance (FD) than previous SOTA models like TRELLIS or UniLat3D.

Performance Comparison Table: TIGON (I+T) achieves the highest CLIP scores and lowest FD metrics across both benchmarks.

Qualitative Insights: Semantic Disambiguation

The true power of TIGON is visible in "controllable generation." For instance, a top-down view of a generic "console" can be transformed into a specific Nintendo Switch-like device or a vintage handheld simply by altering the text prompt. This level of control is impossible with image-only models, which would simply guess the hidden sides.

Visual Results Figure: TIGON successfully generates consistent 3D geometry even when the reference image (dashed boxes) provides limited viewpoint information.

Critical Analysis & Conclusion

TIGON serves as a vital baseline for the newly formalized Text-Image Conditioned 3D Generation task.

  • Key Insight: The paper proves that simple prediction averaging is surprisingly effective if the internal features are allowed to "talk" to each other early in the network.
  • Limitations: When text and image conditions explicitly conflict (e.g., an image of a cat paired with the word "dog"), the model tends to favor the image modality, indicating a potential area for future "modality weighting" research.

In summary, TIGON is a robust, flexible framework that moves 3D generation closer to professional production requirements, where precise control over both semantics and visual identity is non-negotiable.

Find Similar Papers

Try Our Examples

  • Search for recent papers on multimodal 3D generation that utilize native 3D representations like 3D Gaussian Splatting or Sparse Voxel Grids.
  • Identify the origin of zero-initialized "ControlNet" style adapters in diffusion models and how they have been adapted for 3D generative tasks.
  • Explore horizontal applications of dual-branch DiT architectures in other generative domains such as text-to-video or multi-view image synthesis.
Contents
[CVPR 2025] TIGON: Resolving 3D Generation Ambiguity via Dual-Branch Text-Image Fusion
1. TL;DR
2. Background: The "Half-Seen" Problem in 3D
3. Methodology: The TIGON Architecture
3.1. 1. Dual-Branch DiT
3.2. 2. Early & Late Fusion
4. Experiments: Quantitative Superiority
5. Qualitative Insights: Semantic Disambiguation
6. Critical Analysis & Conclusion