GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator

[CVPR 2026] GaussFusion: Cleaning Up the 3D Wild with Geometry-Informed Video Generation

总结

问题

方法

结果

要点

摘要

GaussFusion is a geometry-informed video-to-video diffusion framework designed to refine 3D Gaussian Splatting (3DGS) reconstructions. By conditioning a video generator on a multi-modal "GP-Buffer," it achieves state-of-the-art novel view synthesis and enables real-time refinement at 16 FPS.

TL;DR

3D Gaussian Splatting (3DGS) has revolutionized real-time 3D rendering, but "in-the-wild" captures often result in messy reconstructions filled with "floaters" and blurry artifacts. GaussFusion fixes this by treating 3D refinement as a video-to-video translation task. By feeding a video generator not just color, but also depth, normals, and geometric uncertainty (the GP-Buffer), it transforms noisy 3DGS renders into photorealistic videos at a staggering 16 FPS.

The Problem: Why 3D Splats Still Look "Noisy"

Whether using traditional optimization or modern "feed-forward" models that predict Gaussians in one pass, 3D reconstruction often fails when data is sparse. We see:

Floaters: High-frequency noise floating in mid-air.
Needles: Elongated Gaussians caused by poor camera coverage.
Blur: Low-frequency oversmoothing in feed-forward models.

Previous attempts to use AI to "clean up" these renders mostly used RGB images. But color alone is ambiguous—is that blur a texture or a reconstruction error? Without geometric context, these models struggle to maintain 3D consistency.

Methodology: The Power of the GP-Buffer

The core innovation of GaussFusion is the Gaussian Primitives Buffer (GP-Buffer). Instead of just looking at the final image, the model looks at the "bones" of the reconstruction:

Color & Alpha: The standard appearance.
Depth & Normals: Explicit surface structure.
Inverse Covariance (Uncertainty): A map showing where the splats are "unstable" or over-stretched.

Gauss Fusion Architecture Figure 1: The GaussFusion pipeline. The GP-Buffer is encoded and injected into a Video Transformer (DiT) via a specialized Geometry Adapter.

The Geometry Adapter (GA)

Rather than simply concatenating pixels, GaussFusion uses a Geometry Adapter. This module acts as a "side-car" to a pretrained video model (Wan-2.1), injecting geometric features into the transformer blocks. This ensures the output images are strictly aligned with the underlying 3D math of the scene.

Experiments: Real-World Performance

The authors didn't just train on clean data; they built an Artifact Simulation Pipeline that mimics restricted views and pose errors.

SOTA Comparison

On benchmarks like DL3DV and RE10K, GaussFusion delivers a massive jump in quality. It doesn't just "filter" the image; it contextually inpaints missing regions and sharpens textures.

Experimental Results Comparison Figure 2: Qualitative comparison. Notice how GaussFusion removes spiky artifacts and "floaters" that other methods (like MVSplat360 or ExploreGS) miss.

Efficiency: Real-Time Refinement

One of the biggest hurdles for diffusion models is speed. By using Distribution Matching Distillation (DMD), the team compressed the model from 30-50 steps down to just 4 steps. This allows GaussFusion to run at 16 FPS, making it viable for interactive applications where a user could see a "cleaned" version of their 3D scan almost instantly.

Critical Insight: Paradigm Agnosticism

What makes GaussFusion stand out is that it is paradigm-agnostic. It doesn't care if your 3DGS came from a 10-hour optimization or a 10-millisecond feed-forward prediction. By exposing the model to "Cross-Paradigm" artifacts during training, it becomes a universal "3D Post-Processor."

Conclusion & Future Look

GaussFusion proves that the secret to better 3D isn't just better radiance fields, but smarter generative priors that understand geometry.

Takeaway: The GP-Buffer is a superior conditioning signal for 3D tasks.
Limitation: Very rapid motion or severe blur still poses a challenge for the underlying VAE.
Future: We might soon see video models that don't just refine 3DGS, but directly generate and update the 3D parameters in real-time.

Paper: GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator (CVPR 2026)

发现相似论文

试试这些示例

Search for recent papers that utilize multi-modal buffers beyond RGB-D, such as surface normals or covariance maps, for 3D scene refinement or denoising.
Which studies first introduced the use of video diffusion models for 3D Gaussian Splatting artifact removal, and how does their conditioning strategy compare to GaussFusion's Geometry Adapter?
Identify research exploring the application of real-time video-to-video distillation techniques, like Distribution Matching Distillation (DMD), in the context of interactive 3D reconstruction and rendering.

[CVPR 2026] GaussFusion: Cleaning Up the 3D Wild with Geometry-Informed Video Generation

1. TL;DR

2. The Problem: Why 3D Splats Still Look "Noisy"

3. Methodology: The Power of the GP-Buffer

3.1. The Geometry Adapter (GA)

4. Experiments: Real-World Performance

4.1. SOTA Comparison

4.2. Efficiency: Real-Time Refinement

5. Critical Insight: Paradigm Agnosticism

6. Conclusion & Future Look