Repurposing Geometric Foundation Models for Multi-view Diffusion

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Repurposing Geometric Foundation Models for Multi-view Diffusion

[CVPR 2026] Geometric Latent Diffusion: Repurposing Foundation Models for 3D-Consistent NVS

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces Geometric Latent Diffusion (GLD), a framework that repurposes the feature space of geometric foundation models (e.g., Depth Anything 3) as the latent space for multi-view novel view synthesis (NVS). GLD achieves state-of-the-art 3D consistency and image quality, outperforming VAE-based diffusion models while accelerating training convergence by 4.4x.

Executive Summary

TL;DR: Geometric Latent Diffusion (GLD) shifts the paradigm of Novel View Synthesis (NVS) by moving the diffusion process from a standard VAE space to the feature space of Geometric Foundation Models (like Depth Anything 3). By leveraging latents that already "understand" 3D structure, GLD achieves superior geometric consistency, 4.4x faster training convergence, and zero-shot 3D reconstruction capabilities—all without requiring massive text-to-image pretraining.

Context: In the landscape of generative AI, this work sits at the intersection of Visual Foundation Models and Diffusion Probabilistic Models. It is a "structural optimization" work that proves the latent space itself is the bottleneck for geometry-aware generation.

The Motivation: Why 2D Latents Fail 3D Tasks

Current state-of-the-art NVS methods usually fine-tune Stable Diffusion. However, Stable Diffusion lives in a VAE latent space optimized for texture and 2D semantics, not 3D spatial relationships.

The authors argue that this forces the diffusion model to do "double duty": it must learn to generate pixels while simultaneously trying to rediscover the laws of epipolar geometry. This leads to common failure modes like "flickering" textures or warped structures when moving between viewpoints.

The Insight: If we use a latent space from a model already trained for depth estimation and point matching, the diffusion process "inherits" a coordinate system that is naturally 3D-aware.

Methodology: The GLD Framework

1. Repurposing the Geometric Backbone

GLD utilizes Depth Anything 3 (DA3) as its core. Instead of a VAE, it uses the DA3 encoder to transform images into a multi-level feature hierarchy. The key discovery is the Boundary Layer Selection:

Shallow layers (Level 0): High texture/color detail, low geometric consistency.
Deep layers (Level 2-3): High geometric abstraction, loss of fine-grained photometric detail.
The Sweet Spot (Level 1): The optimal boundary that provides enough spatial grounding for 3D consistency while retaining enough appearance info for high-fidelity decoding.

Geometric Latent Diffusion Pipeline

2. Cascaded Feature Synthesis

Synthesizing all four levels of a foundation model is expensive. GLD uses a "Propagate-and-Cascade" strategy:

Direct Synthesis: A diffusion model generates Level 1 features.
Propagation: Deeper features (Lv 2-3) are derived by simply passing Lv 1 through the frozen DA3 blocks.
Cascaded Generation: A smaller diffusion model ( $M_{1 o 0}$ ) generates the high-res Lv 0 features, conditioned on Lv 1 to ensure they don't drift apart.

Experiments: Breaking the "Pretraining" Dependency

One of the most striking results is that GLD, trained from scratch on relatively small datasets (RealEstate10K, DL3DV), outperforms or matches models like MVGenMaster and CAT3D which rely on the massive priors of Stable Diffusion.

Quantitative Superiority

| Metric | VAE (Scratch) | GLD (Ours) | Improvement | | :--- | :--- | :--- | :--- | | PSNR (Higher is better) | 15.65 | 16.36 | +4.5% | | ATE (Lower is better) | 0.278 | 0.211 | -24% Error | | Training Speed | 1.0x | 4.4x | Much Faster |

The 3D metrics (ATE, RPE) show that GLD's camera adherence is significantly more precise, meaning the generated views actually "stay in place" relative to the requested camera movement.

Zero-Shot 3D Reconstruction

Because GLD generates features in the DA3 space, you can plug the generated latents into the original DA3 depth head. This means for every image you generate, you get a perfectly aligned depth map for free, allowing for instant 3D point cloud unprojection.

Qualitative Comparison Visual evidence shows GLD maintaining sharp edges and correct perspective even where VAE-based models hallucinate or blur.

Critical Insights & Limitations

The Power of Inductive Bias: This paper proves that choosing the right "language" (latent space) for a task can be more powerful than just throwing more data or larger T2I priors at the problem.
Efficiency: The 4.4x training speedup suggests that the model spends less time learning "how 3D works" and more time learning "what the scene looks like."
Limitations: The model still struggles with extreme occlusions—places where the foundation model itself hasn't seen enough data to provide a reliable prior. Furthermore, the two-stage sampling (Lv 1 then Lv 0) adds some inference latency.

Conclusion

Geometric Latent Diffusion is a compelling argument for Task-Specific Foundation Latents. As we move toward more specialized AI applications (Robotics, AR/VR, Med-Tech), the era of using a one-size-fits-all 2D VAE may be coming to an end. GLD paves the way for generative models that aren't just "painting" 3D scenes, but truly "constructing" them.

Find Similar Papers

Try Our Examples

Which recent papers explore "representation autoencoders" (RAE) for tasks beyond 2D image synthesis, specifically focusing on 3D or video generation?
What is the theoretical basis for identifying "boundary layers" in hierarchical vision transformers for generative vs. discriminative downstream tasks?
Search for other geometric foundation models like VGGT or Depth Anything 3 being used as plug-and-play encoders for conditional diffusion frameworks.

Contents

[CVPR 2026] Geometric Latent Diffusion: Repurposing Foundation Models for 3D-Consistent NVS

1. Executive Summary

2. The Motivation: Why 2D Latents Fail 3D Tasks

3. Methodology: The GLD Framework

3.1. 1. Repurposing the Geometric Backbone

3.2. 2. Cascaded Feature Synthesis

4. Experiments: Breaking the "Pretraining" Dependency

4.1. Quantitative Superiority

4.2. Zero-Shot 3D Reconstruction

5. Critical Insights & Limitations

6. Conclusion