WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[arXiv 2026] CAM3R: Breaking the Pinhole Ceiling with Camera-Agnostic 3D Reconstruction
总结
问题
方法
结果
要点
摘要

CAM3R is a camera-agnostic feed-forward framework for dense 3D reconstruction from unposed and uncalibrated images. It outperforms existing foundation models like DUSt3R by explicitly decoupling camera-specific ray estimation from scene geometry, achieving state-of-the-art results across pinhole, fisheye, and panoramic modalities.

TL;DR

Existing 3D vision foundation models like DUSt3R are "trapped" in a pinhole world. When fed with fisheye or 360° panoramic images, they collapse because they assume light travels in a rigid perspective grid. CAM3R solves this by decoupling the camera's optical manifold from the scene's geometry. By learning to "see" along rays rather than fixed pixels, it provides robust, calibration-free 3D reconstruction for virtually any lens type.

The "Pinhole Bias" Problem

Most 3D datasets (like MegaDepth) are perspective-heavy. Consequently, state-of-the-art models learn an implicit Inductive Bias toward pinhole geometry. When these models encounter wide-angle imagery:

  1. Rectification Artefacts: Traditional "undistorting" stretches pixels aggressively at the edges, losing information.
  2. Geometric Collapse: Direct regression models entangle lens distortion with depth, leading to warped point clouds where straight walls appear bent.

CAM3R identifies that the fundamental objective isn't just predicting (X, Y, Z), but determining the direction and distance of light for every pixel.

Methodology: The Core of CAM3R

The architecture is elegantly split into two specialized modules that handle the "How" of the camera and the "What" of the scene separately.

1. The Ray Module (RM)

Instead of assuming a camera matrix, the RM learns a continuous ray field. It uses Spherical Harmonic (SH) expansion to output coefficients that define a per-pixel unit vector $d_i(u)$. This allows the model to map pixels to 3D directions for any lens—be it a standard iPhone lens or a 360° RICOH Theta.

2. The Cross-view Module (CVM)

The CVM doesn't waste capacity learning lens distortions. Instead, it focuses on Radial Distance ($r_i$) and relative pose between images. The final 3D point is simply the product: $X(u) = d_i(u) \cdot r_i(u)$.

Overall Architecture Fig 1: The CAM3R pipeline showing the decoupled Ray and Cross-view modules.

3. Ray-Aware Global Alignment

In multi-view scenarios, standard Bundle Adjustment assumes linear pixel-to-3D mapping. CAM3R introduces a Ray-Aware optimization. During pose refinement, it freezes the predicted ray directions and only optimizes the scale and distance along those rays. This "ray-consistency" prevents the optimizer from corrupting the local geometry when trying to minimize global errors.

Performance: SOTA Across Modalities

The results are most striking when looking at Cross-Modal performance. While models like π³ or DUSt3R perform well on standard pinhole data (MegaDepth), their accuracy falls to near zero on panoramic datasets like 360Loc.

Two-View Results Fig 2: Qualitative comparison showing CAM3R's ability to preserve structural integrity (green) where baselines (red) produce curved, non-physical geometry.

| Dataset | Model | RRA@15 (Rotation Accuracy) | RTA@15 (Translation Accuracy) | | :--- | :--- | :---: | :---: | | 2D3DS (Panorama) | DUSt3R | 10.6% | 6.0% | | | CAM3R | 97.7% | 94.3% | | CO3Dv2 (Zero-Shot)| DUSt3R | 94.7% | 43.1% | | | CAM3R | 97.5% | 88.2% |

Critical Insight & Future Outlook

The primary contribution of CAM3R is the proof that decoupling is better than entanglement. By forcing the network to learn a "Ray-Field" explicitly, the authors have created a model that is truly "Sensor-Agnostic."

Limitations:

  • The dual-ViT backbone setup is computationally expensive (high VRAM usage).
  • $O(N^2)$ pairwise checks for the scene graph limit scalability to thousands of images.

Future Directions: The authors suggest unifying the encoders to improve speed and exploring advanced positional encodings to capture higher-frequency details in wide-angle scenes. CAM3R effectively sets a new standard for 3D reconstruction in robotics and egocentric vision, where fisheye lenses are the norm, not the exception.

Conclusion

CAM3R marks the end of the "Pinhole Era" for 3D foundation models. By leveraging ray-based representations and decoupled supervision, it enables high-fidelity 3D reconstruction from any camera, anywhere, without the need for prior calibration or pose information.

发现相似论文

试试这些示例

  • Find recent papers that utilize Spherical Harmonics for continuous camera intrinsic estimation beyond monocular depth tasks.
  • Which 3D foundation models prior to CAM3R attempted to integrate non-pinhole camera models, and how do their ray-tracing mechanisms compare?
  • Explore research applying camera-agnostic 3D reconstruction to SLAM or real-time robotics in environments with diverse lens types.
目录
[arXiv 2026] CAM3R: Breaking the Pinhole Ceiling with Camera-Agnostic 3D Reconstruction
1. TL;DR
2. The "Pinhole Bias" Problem
3. Methodology: The Core of CAM3R
3.1. 1. The Ray Module (RM)
3.2. 2. The Cross-view Module (CVM)
3.3. 3. Ray-Aware Global Alignment
4. Performance: SOTA Across Modalities
5. Critical Insight & Future Outlook
6. Conclusion