One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

[arXiv 2024] OVIE: One View Is Enough for In-the-Wild Novel View Generation

总结

问题

方法

结果

要点

摘要

OVIE is a groundbreaking monocular novel-view synthesis framework that eliminates the need for multi-view image pairs during training. By leveraging a metric depth estimator to create "pseudo-target" views from 30 million unpaired internet images, it achieves state-of-the-art zero-shot generalization and operates at over 100 FPS.

TL;DR

Novel view synthesis (NVS) from a single image has finally broken free from the "multi-view pair" bottleneck. OVIE (One View Is Enough) proposes a training paradigm that uses a metric depth estimator to turn any single image into a training pair. Trained on 30 million "in-the-wild" images, OVIE achieves SOTA out-of-domain generalization while being 600x faster (100+ FPS) than existing high-performance baselines.

Background: The Scarcity of Multi-View Data

For years, monocular NVS models were "trust-fund babies"—they inherited their geometric understanding from expensive, specialized datasets like RealEstate10K or DL3DV. These datasets require calibrated, multi-view setups that simply don't exist for the vast majority of the internet's visual content.

Previous attempts to solve this involved:

Synthetic data: High diversity but a massive "domain gap" from real photos.
Generative Priors (Diffusion): Excellent quality but painfully slow (often taking seconds to generate one frame) and prone to "forgetting" the target pose.
Warp-and-Inpaint: Highly dependent on the depth estimator at inference time; if the depth fails, the whole scene falls apart.

OVIE changes the game by proving that we can use monocular depth estimation as a training-time scaffold and then throw it away at inference.

Methodology: Mining 3D Knowledge from 2D Images

The core intuition behind OVIE is that monocular depth estimation has reached a tipping point of accuracy. While not perfect, it is "good enough" to act as a teacher.

1. The Pseudo-Pair Factory

At training time, the authors take an unlabelled image, lift it into a 3D point cloud using MoGe-2, and sample a new camera pose. By reprojecting the points, they create a pseudo-target. This target is sparse (it has holes where the camera "sees" behind objects), but it provides a ground-truth signal for the pixels that are visible.

Overall Architecture

2. Geometry-Free Inference

Unlike warping-based methods, OVIE's model (a Transformer-based architecture) takes only the source image and a target pose (7D vector). It doesn't look at depth maps or point clouds during inference. This "geometry-free" design allows the model to learn the underlying physics of parallax and occlusion implicitly.

3. The Loss Trinity

To make the model robust to the "noisy" pseudo-targets, the authors use:

Masked Reconstruction Loss: Only penalizes the model on pixels that were successfully reprojected.
Masked Perceptual Loss (P-DINO/LPIPS): Ensures semantic consistency.
Adversarial PatchGAN: Forces the model to hallucinate realistic textures in the "holes" (disocclusions) that the point cloud couldn't fill.

Experiments: Speed Meets Generalization

The results are clear: OVIE isn't just a research curiosity; it's a practical powerhouse.

SOTA Performance

On the DL3DV benchmark—a fair "blind test" for all models because none were trained on it—OVIE outperformed VIVID (a strong diffusion baseline) on every metric, including a 24% improvement in FID (realism).

Experimental Results Comparison

Breakthrough Speed

The most impressive feat is the throughput. While diffusion models crawl at 0.02 to 0.19 FPS, OVIE blazes at 116 FPS. This makes real-time, interactive exploration of a single photograph possible for the first time.

PSNR vs FPS Comparison

Qualitative Analysis

OVIE demonstrates a remarkable "metric scale awareness." Since it was trained using a metric depth estimator, it understands that nearby objects should move faster than distant ones when the camera translates—a property often lost in models that treat scale as ambiguous. It even generalizes to paintings and historical archives where no 3D data could ever exist.

Qualitative comparison

Critical Insight & Conclusion

OVIE proves a fundamental thesis in modern AI: Scaling data is more important than specialized architectures. By converting the "abundance" of single images into "synthetic" multi-view pairs, the authors have tapped into a virtually infinite training signal.

Limitations: The model's quality is still tied to the quality of the offline depth estimator used for training. If the depth teacher is consistently wrong about a certain type of geometry, the model inherits that bias.

Future Work: Imagine this model combined with a world model for long-term consistency. OVIE provides the local 3D "common sense" that could empower the next generation of generative AI agents and interactive digital twins.

发现相似论文

试试这些示例

Which recent papers explore using monocular depth estimators as "pseudo-labels" for training view synthesis or 3D reconstruction models in a self-supervised manner?
What are the original theoretical foundations for "geometry-free" novel view synthesis as opposed to explicit 3D representations like Gaussian Splatting?
How can the metric-scale awareness techniques used in OVIE be adapted for real-time robotics or AR applications requiring fast monocular depth-to-view synthesis?

[arXiv 2024] OVIE: One View Is Enough for In-the-Wild Novel View Generation

1. TL;DR

2. Background: The Scarcity of Multi-View Data

3. Methodology: Mining 3D Knowledge from 2D Images

3.1. 1. The Pseudo-Pair Factory

3.2. 2. Geometry-Free Inference

3.3. 3. The Loss Trinity

4. Experiments: Speed Meets Generalization

4.1. SOTA Performance

4.2. Breakthrough Speed

5. Qualitative Analysis

6. Critical Insight & Conclusion