SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

[CVPR 2025] SR3R: Rethinking 3D Super-Resolution via Feed-Forward Gaussian Splatting

总结

问题

方法

结果

要点

摘要

This paper introduces SR3R, a novel feed-forward framework for 3D Super-Resolution (3DSR) built on Gaussian Splatting (3DGS). It reformulates 3DSR as a direct mapping from sparse low-resolution (LR) views to high-resolution (HR) 3DGS representations, achieving SOTA performance and strong zero-shot generalization without per-scene optimization.

Executive Summary

TL;DR: SR3R (Super-Resolution 3D Reconstruction) transforms the 3D super-resolution task from a slow, per-scene optimization process into a lightning-fast feed-forward mapping. By training on large-scale datasets, it learns "3D-native" high-frequency textures that generic 2D super-resolution models miss, enabling high-fidelity 3D reconstruction from as few as two low-resolution (LR) views.

Background: Most 3D Gaussian Splatting (3DGS) models require high-resolution, dense inputs. Current 3DSR attempts to fix this by using 2D upsamplers to "cheat" via pseudo-labels, but this creates view inconsistency and lacks scalability. SR3R is a paradigm shift toward generalized, data-driven 3D refinement.

Problem & Motivation: The "2D Prior" Ceiling

Why can't we just use a 2D Super-Resolution (2DSR) model and then run 3DGS?

View Inconsistency: 2D models process each image independently; textures "flicker" or shift when projected into 3D.
Optimization Bottleneck: Current 3DSR requires 5-10 minutes of optimization per scene.
Pre-defined Priors: 2DSR models are trained on natural images, not multi-view 3D layouts, leading to hallucinated artifacts that don't satisfy geometric constraints.

Methodology: The Core Architecture

SR3R's innovation lies in its three-stage pipeline that is entirely feed-forward.

1. The Gaussian Shuffle Split (Scaffold)

Instead of starting from scratch, SR3R takes a coarse LR 3DGS (from a backbone like DepthSplat) and "densifies" it. Through a Shuffle Split operation, each LR Gaussian is split into six sub-Gaussians. This provides a structural "scaffold" for the network to refine.

2. ViT-based Feature Refinement

The model extracts features from the LR views using a ViT encoder. Crucially, it uses bidirectional cross-attention to align these 2D features with the 3D-aware tokens from the reconstruction backbone. This suppresses 2D artifacts and ensures the features are "geometry-ready."

3. Gaussian Offset Learning

This is the "secret sauce." Instead of predicting the absolute position or color of Gaussians (a high-variance, multi-modal problem), the network predicts residual offsets ( $Δ G$ ) to the scaffold using PointTransformerV3.

Model Architecture Figure 1: The SR3R framework, showing the transition from LR views to a refined HR 3DGS through offset learning.

Experiments & SOTA Results

SR3R was benchmarked against the toughest baselines including NoPoSplat and DepthSplat (with upsampling) and per-scene optimizers like SRGS.

Quantitative Dominance: On the ACID dataset, SR3R achieved a PSNR of 27.018, outperforming the upsampled baseline (25.315) significantly.
Zero-Shot Generalization: Perhaps the most impressive feat is SR3R's performance on the DTU and ScanNet++ datasets. Even without seeing these scenes during training, it outperformed the specialized per-scene optimization method FSGS+SRGS.
Speed: SR3R takes ~1.69 seconds for inference, compared to 300+ seconds for optimization-based methods.

Qualitative Results Figure 2: Qualitative comparison showing SR3R recovering significantly sharper textures and cleaner edges than existing feed-forward baselines.

Critical Analysis & Conclusion

Takeaway

SR3R proves that residual learning in 3D space is more effective than image-space upsampling. By constraining the problem to "offsets" on a densified scaffold, the model achieves stability that allows for better high-frequency detail recovery.

Limitations

While fast and accurate, SR3R does add moderate computational overhead compared to "base" feed-forward models (e.g., higher memory usage due to densification). It also currently focuses on 4x upscaling; arbitrary scaling factors might require further architectural flexibility.

Future Outlook

This work paves the way for high-quality 3D content creation on mobile devices or via low-bandwidth streaming, where only sparse LR data can be transmitted, but high-fidelity 3D interaction is required at the edge.

发现相似论文

试试这些示例

Search for recent papers that use feed-forward transformers or status-space models to directly reconstruct 3D Gaussian Splatting from sparse images.
Which paper first introduced the 'Gaussian Shuffle Split' or 'Gaussian Offset Learning' concept, and how does SR3R's implementation differ from that origin?
Examine research applying 3D Super-Resolution (3DSR) techniques to multi-modal generative tasks or large-scale robotics environment mapping.

[CVPR 2025] SR3R: Rethinking 3D Super-Resolution via Feed-Forward Gaussian Splatting

1. Executive Summary

2. Problem & Motivation: The "2D Prior" Ceiling

3. Methodology: The Core Architecture

3.1. 1. The Gaussian Shuffle Split (Scaffold)

3.2. 2. ViT-based Feature Refinement

3.3. 3. Gaussian Offset Learning

4. Experiments & SOTA Results

5. Critical Analysis & Conclusion

5.1. Takeaway

5.2. Limitations

5.3. Future Outlook