WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[CVPR 2026] 2K Retrofit: Scaling 3D Foundation Models to 2K Resolution via Entropy-Guided Sparsity
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces 2K Retrofit, a model-agnostic sparse refinement framework that enables existing 3D foundation models (e.g., Depth Anything, VGGT) to perform 2K-resolution geometric prediction. It achieves state-of-the-art accuracy in depth and pointmap estimation while being significantly more efficient than dense inference, reaching 8.1 FPS for 2K depth.

TL;DR

High-resolution 3D reconstruction is crucial for autonomous driving and AR, but 2K-resolution dense inference is a memory nightmare. 2K Retrofit offers a "plug-and-play" solution: it freezes your favorite 3D foundation model (like Depth Anything v2 or VGGT), identifies high-error regions using an entropy-based selector, and refines only those sparse pixels. The result? 2K-level geometric precision at 8.1 FPS, with zero backbone retraining.

Background: The Resolution Wall

While 3D foundation models have revolutionized zero-shot depth estimation, they are often "trapped" in low-resolution training regimes (typically sub-1K). Moving to 2K resolution usually implies a quadratic explosion in FLOPs and memory. Previous attempts—such as tile-based processing—often result in inconsistent "seams" or redundant computation on flat, low-frequency surfaces (like walls) where high resolution adds little value.

The Core Insight: Error Sparsity

The authors observe that the delta between a low-resolution "coarse" prediction and a true 2K "fine" prediction isn't uniform. Errors are concentrated at semantic boundaries and thin structures (e.g., cables, handles). By focusing computation only on these "uncertain" regions, we can bypass the "Resolution Wall."

Methodology: The Sparse Pipeline

The 2K Retrofit architecture consists of a three-stage refinement loop:

  1. Coarse Initialization: The 2K image is downsampled, processed by a frozen foundation model (F), and upsampled via nearest-neighbor interpolation to create a base geometric map.
  2. Entropy-Based Selection: Instead of a complex learnable selector, the authors use the entropy of the backbone's head features. High entropy correlates perfectly with geometric ambiguity (boundaries). This selector identifies the top ~10% of pixels that need help.
  3. Sparse Refinement & Gated Fusion: A MinkowskiUNet (designed for sparse data) processes these pixels. Finally, a gated mechanism decides whether to trust the global coarse prediction or the local refined one.

Model Architecture Fig 2: The 2K Retrofit pipeline, showing the transition from coarse global estimates to sparse, high-fidelity refinement.

Experimental Triumphs

The researchers tested 2K Retrofit across monocular depth (ARKitScenes, ScanNet++) and multi-view pointmap tasks (ETH3D).

  • Efficiency: Compared to retraining a backbone like VGGT at 2K, 2K Retrofit provides a 17x speedup, reducing GFLOPs from 495 to 172.
  • Accuracy: It consistently outperforms "Patch-based" SOTA like PatchRefiner and PRO, particularly in maintaining global consistency while capturing razor-sharp edges.

Qualitative Results Fig 3: Qualitative comparison on ETH3D. Note how 2K Retrofit (right) recovers fine structures like chair legs and thin poles that foundation models often blur (middle).

Critical Analysis: Why It Works

The "magic" lies in the Entropy Selector. By deriving uncertainty from the latent features before the final regression, the model taps into the backbone's internal doubt. Using a sparse Minkowski Engine for the refinement branch is a brilliant move—it treats high-resolution pixels like a sparse 3D point cloud, making it incredibly lightweight compared to standard dense CNNs.

Limitations: The model still faces challenges in textureless or highly reflective regions (e.g., glass, mirrors) where neither the coarse nor the refined features provide sufficient geometric cues.

Conclusion & Future Impact

2K Retrofit is a significant step toward making high-resolution 3D vision practical for real-time edge deployment. By treating high resolution as a sparse correction problem rather than a dense reconstruction problem, it paves the way for 4K or even 8K geometric perception in the near future.


Keywords: High-Resolution Depth, Sparse Refinement, 3D Foundation Models, Minkowski Convolutions, Autonomous Driving.

Find Similar Papers

Try Our Examples

  • Find recent papers from 2024-2025 that use sparse convolution or Minkowski Engines for high-resolution image-to-depth refinement tasks.
  • What is the theoretical origin of using entropy as an uncertainty measure in regression-based vision transformers, and how does this paper adapt it for 3D geometry?
  • Explore if the gated fusion and entropy-based refinement mechanism in 2K Retrofit has been applied to 4D scene flow or video-based 3D reconstruction.
Contents
[CVPR 2026] 2K Retrofit: Scaling 3D Foundation Models to 2K Resolution via Entropy-Guided Sparsity
1. TL;DR
2. Background: The Resolution Wall
3. The Core Insight: Error Sparsity
4. Methodology: The Sparse Pipeline
5. Experimental Triumphs
6. Critical Analysis: Why It Works
7. Conclusion & Future Impact