The paper introduces TIGON, a minimalist dual-branch 3D generative baseline that formalizes the task of Text–Image Conditioned 3D Generation. By combining a modality-specific Diffusion Transformer (DiT) architecture with lightweight cross-modal fusion, it achieves state-of-the-art performance in generating high-fidelity 3D assets that are both semantically aligned with text and visually faithful to image exemplars.
TL;DR
While image-conditioned 3D generation excels at texture but fails on hidden geometry, and text-to-3D provides structure but lacks detail, TIGON bridges the gap. By introducing a dual-branch Diffusion Transformer (DiT) with cross-modal bridges, TIGON allows users to provide both a visual reference and a textual description, resulting in 3D assets that are both semantically precise and visually stunning.
Background: The "Half-Seen" Problem in 3D
Most SOTA 3D generators belong to one of two camps:
- Image-to-3D: Captures great local detail but "hallucinates" the back of the object because it lacks semantic context.
- Text-to-3D: Understands what a "tiger" looks like globally but cannot match a specific artistic style or color palette provided by a user.
The authors identify a significant performance drop (up to 3x error increase) when an image-only model is given a "low-information" view (e.g., looking at a trophy from the top). TIGON was designed to solve this by using text to "fill in the blanks" of the image.
Methodology: The TIGON Architecture
TIGON avoids the common pitfall of mixing image and text tokens into a single transformer, which often leads to "semantic dilution." Instead, it maintains two specialized backbones:
1. Dual-Branch DiT
The model features an Image Branch (processing dense, pixel-aligned tokens) and a Text Branch (processing sparse, semantic CLIP tokens). This ensures that neither modality compromises the other's specialized feature space.
2. Early & Late Fusion
- Early Fusion: "Cross-modal bridges" (linear layers) are placed between every block of the two transformers. These are zero-initialized to ensure training stability, gradually allowing information to flow between the visual and semantic branches.
- Late Fusion: During the denoising process, TIGON averages the predicted velocity fields from both branches at every timestep.
Figure: The dual-branch pipeline showing feature exchange via cross-modal bridges and final prediction averaging.
Experiments: Quantitative Superiority
The results on Toys4K and UniLat1K demonstrate that joint conditioning isn't just a convenience—it's a massive performance booster. In scenarios where a single-view image is ambiguous, adding a text prompt allows TIGON to achieve a significantly lower Frechet Distance (FD) than previous SOTA models like TRELLIS or UniLat3D.
Table: TIGON (I+T) achieves the highest CLIP scores and lowest FD metrics across both benchmarks.
Qualitative Insights: Semantic Disambiguation
The true power of TIGON is visible in "controllable generation." For instance, a top-down view of a generic "console" can be transformed into a specific Nintendo Switch-like device or a vintage handheld simply by altering the text prompt. This level of control is impossible with image-only models, which would simply guess the hidden sides.
Figure: TIGON successfully generates consistent 3D geometry even when the reference image (dashed boxes) provides limited viewpoint information.
Critical Analysis & Conclusion
TIGON serves as a vital baseline for the newly formalized Text-Image Conditioned 3D Generation task.
- Key Insight: The paper proves that simple prediction averaging is surprisingly effective if the internal features are allowed to "talk" to each other early in the network.
- Limitations: When text and image conditions explicitly conflict (e.g., an image of a cat paired with the word "dog"), the model tends to favor the image modality, indicating a potential area for future "modality weighting" research.
In summary, TIGON is a robust, flexible framework that moves 3D generation closer to professional production requirements, where precise control over both semantics and visual identity is non-negotiable.
