This paper introduces SR3R, a novel feed-forward framework for 3D Super-Resolution (3DSR) built on Gaussian Splatting (3DGS). It reformulates 3DSR as a direct mapping from sparse low-resolution (LR) views to high-resolution (HR) 3DGS representations, achieving SOTA performance and strong zero-shot generalization without per-scene optimization.
Executive Summary
TL;DR: SR3R (Super-Resolution 3D Reconstruction) transforms the 3D super-resolution task from a slow, per-scene optimization process into a lightning-fast feed-forward mapping. By training on large-scale datasets, it learns "3D-native" high-frequency textures that generic 2D super-resolution models miss, enabling high-fidelity 3D reconstruction from as few as two low-resolution (LR) views.
Background: Most 3D Gaussian Splatting (3DGS) models require high-resolution, dense inputs. Current 3DSR attempts to fix this by using 2D upsamplers to "cheat" via pseudo-labels, but this creates view inconsistency and lacks scalability. SR3R is a paradigm shift toward generalized, data-driven 3D refinement.
Problem & Motivation: The "2D Prior" Ceiling
Why can't we just use a 2D Super-Resolution (2DSR) model and then run 3DGS?
- View Inconsistency: 2D models process each image independently; textures "flicker" or shift when projected into 3D.
- Optimization Bottleneck: Current 3DSR requires 5-10 minutes of optimization per scene.
- Pre-defined Priors: 2DSR models are trained on natural images, not multi-view 3D layouts, leading to hallucinated artifacts that don't satisfy geometric constraints.
Methodology: The Core Architecture
SR3R's innovation lies in its three-stage pipeline that is entirely feed-forward.
1. The Gaussian Shuffle Split (Scaffold)
Instead of starting from scratch, SR3R takes a coarse LR 3DGS (from a backbone like DepthSplat) and "densifies" it. Through a Shuffle Split operation, each LR Gaussian is split into six sub-Gaussians. This provides a structural "scaffold" for the network to refine.
2. ViT-based Feature Refinement
The model extracts features from the LR views using a ViT encoder. Crucially, it uses bidirectional cross-attention to align these 2D features with the 3D-aware tokens from the reconstruction backbone. This suppresses 2D artifacts and ensures the features are "geometry-ready."
3. Gaussian Offset Learning
This is the "secret sauce." Instead of predicting the absolute position or color of Gaussians (a high-variance, multi-modal problem), the network predicts residual offsets () to the scaffold using PointTransformerV3.
Figure 1: The SR3R framework, showing the transition from LR views to a refined HR 3DGS through offset learning.
Experiments & SOTA Results
SR3R was benchmarked against the toughest baselines including NoPoSplat and DepthSplat (with upsampling) and per-scene optimizers like SRGS.
- Quantitative Dominance: On the ACID dataset, SR3R achieved a PSNR of 27.018, outperforming the upsampled baseline (25.315) significantly.
- Zero-Shot Generalization: Perhaps the most impressive feat is SR3R's performance on the DTU and ScanNet++ datasets. Even without seeing these scenes during training, it outperformed the specialized per-scene optimization method FSGS+SRGS.
- Speed: SR3R takes ~1.69 seconds for inference, compared to 300+ seconds for optimization-based methods.
Figure 2: Qualitative comparison showing SR3R recovering significantly sharper textures and cleaner edges than existing feed-forward baselines.
Critical Analysis & Conclusion
Takeaway
SR3R proves that residual learning in 3D space is more effective than image-space upsampling. By constraining the problem to "offsets" on a densified scaffold, the model achieves stability that allows for better high-frequency detail recovery.
Limitations
While fast and accurate, SR3R does add moderate computational overhead compared to "base" feed-forward models (e.g., higher memory usage due to densification). It also currently focuses on 4x upscaling; arbitrary scaling factors might require further architectural flexibility.
Future Outlook
This work paves the way for high-quality 3D content creation on mobile devices or via low-bandwidth streaming, where only sparse LR data can be transmitted, but high-fidelity 3D interaction is required at the edge.
