WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
OccAny: Breaking the Chains of Sensor-Rig Calibration in Urban 3D Occupancy
Summary
Problem
Method
Results
Takeaways
Abstract

OccAny is a novel generalized 3D occupancy framework designed for unconstrained urban environments, capable of zero-shot inference on out-of-domain uncalibrated scenes. It achieves state-of-the-art performance by outperforming visual geometry baselines and rivaling in-domain self-supervised methods on benchmarks like SemanticKITTI (up to 25.91% IoU) and Occ3D-NuScenes (34.15% IoU).

    ## Executive Summary
    **TL;DR**: OccAny is a paradigm-shifting framework that untethers 3D occupancy prediction from the rigid requirements of sensor calibration and in-domain training. By leveraging visual geometry foundation models and introducing novel strategies like **Segmentation Forcing** and **Novel View Rendering**, it achieves remarkable zero-shot generalization across diverse urban datasets.

    **Background Positioning**: This work represents a significant leap from *specialized* occupancy networks to *generalized* 3D perception foundation models. It bridges the gap between general-purpose point-map predictors (like Dust3r) and the specific, metric-scaled needs of autonomous driving.

    ## The Problem: The "Calibration Trap"
    Traditional SOTA occupancy models are "sensor-locked." They rely on fixed camera intrinsics and extrinsics to lift 2D features into 3D space. This creates a massive bottleneck: if the sensor rig changes even slightly, the model's performance collapses. Furthermore, urban scenes are cluttered, and sparse LiDAR supervision often leaves "holes" in the predicted geometry, particularly in non-visible regions.

    ## Methodology: The "Secret Sauce" of OccAny
    OccAny tackles these challenges through two primary innovations:

    ### 1. Segmentation Forcing
    The authors realized that geometric data alone is often too sparse to supervise dense occupancy. By introducing **Segmentation Forcing**, they distill high-fidelity semantic cues from foundation models (like SAM2) into the geometry-focused backbone. This regularizes the geometry prediction, using semantic consistency (e.g., "this is a continuous car surface") to fill in the gaps where LiDAR signals are absent.

    ### 2. Novel View Rendering (NVR) & TTVA
    To solve the problem of occlusion and "invisible" geometry, OccAny doesn't just predict what it sees. It uses a **Novel View Rendering** pipeline to hallucinate the scene from arbitrary new perspectives at test-time (**Test-time View Augmentation**). This allows the model to "peek" around corners and densify the voxel grid, leading to much more complete reconstructions.

    ![Overall Architecture](https://cdn.atominnolab.com/wisdoc/images/20260325-de9efeb3-166e-4c63-9782-9e9ff2c8792a/page_002_block_000.png)
    *Figure 1: The two-stage training process: 3D Reconstruction with Segmentation Forcing followed by Novel-View Rendering.*

    ## Experimental Prowess
    The results are striking. In a **Zero-shot** setting (trained on Waymo/ONCE/etc. and tested on SemanticKITTI), OccAny outperforms existing specialized baselines.

    *   **Metric Accuracy**: Unlike scale-invariant models, OccAny predicts metric pointmaps natively.
    *   **Generalization**: On SemanticKITTI monocular tasks, it reached a **24.03% IoU**, beating self-supervised models that were actually trained *on* that dataset.
    *   **Versatility**: The SAME model handles monocular, sequential, and surround-view inputs without reconfiguration.

    ![Experimental Results](https://cdn.atominnolab.com/wisdoc/tables/20260325-de9efeb3-166e-4c63-9782-9e9ff2c8792a/page_005_block_006.png)
    *Table 1: OccAny's superior performance across benchmarks compared to geometric foundation model baselines.*

    ## Deep Insights & Critical Analysis
    The most profound takeaway from OccAny is the **utility of "test-time imagination."** By allowing the model to render novel views during inference, the authors effectively converted a 3D completion problem into a multi-view consistency problem.

    **Limitations**: Despite its strengths, a gap still exists compared to fully-supervised, in-domain models. Furthermore, while the NVR pipeline is efficient, performing 50+ view augmentations in real-time monocular settings remains a computational challenge for edge devices.

    ## Conclusion
    OccAny proves that we are entering the era of "Unconstrained Perception." By moving away from rigid geometric priors and toward semantic-aware, rendering-capable foundation models, we can finally build autonomous systems that understand the 3D world as flexibly as humans do.

    ![Qualitative Results](https://cdn.atominnolab.com/wisdoc/images/20260325-de9efeb3-166e-4c63-9782-9e9ff2c8792a/page_016_block_007.png)
    *Figure 2: Qualitative comparison showing OccAny's superior density and accuracy in voxel prediction.*

Find Similar Papers

Try Our Examples

  • Search for recent papers on generalized 3D occupancy prediction that do not require camera intrinsics or extrinsic calibration at test time.
  • How does the "Segmentation Forcing" distillation in OccAny compare to other cross-modal distillation techniques like "Feature Distillation from SAM" for 3D scene understanding?
  • Explore the application of "Test-time View Augmentation" (TTVA) and novel view synthesis in other 3D reconstruction tasks such as SLAM or robotic manipulation.
Contents
OccAny: Breaking the Chains of Sensor-Rig Calibration in Urban 3D Occupancy
1. Executive Summary
2. The Problem: The "Calibration Trap"
3. Methodology: The "Secret Sauce" of OccAny
3.1. 1. Segmentation Forcing
3.2. 2. Novel View Rendering (NVR) & TTVA
4. Experimental Prowess
5. Deep Insights & Critical Analysis
6. Conclusion