OccAny: Breaking the Chains of Sensor-Rig Calibration in Urban 3D Occupancy
Summary
Problem
Method
Results
Takeaways
Abstract
OccAny is a novel generalized 3D occupancy framework designed for unconstrained urban environments, capable of zero-shot inference on out-of-domain uncalibrated scenes. It achieves state-of-the-art performance by outperforming visual geometry baselines and rivaling in-domain self-supervised methods on benchmarks like SemanticKITTI (up to 25.91% IoU) and Occ3D-NuScenes (34.15% IoU).
## Executive Summary
**TL;DR**: OccAny is a paradigm-shifting framework that untethers 3D occupancy prediction from the rigid requirements of sensor calibration and in-domain training. By leveraging visual geometry foundation models and introducing novel strategies like **Segmentation Forcing** and **Novel View Rendering**, it achieves remarkable zero-shot generalization across diverse urban datasets.
**Background Positioning**: This work represents a significant leap from *specialized* occupancy networks to *generalized* 3D perception foundation models. It bridges the gap between general-purpose point-map predictors (like Dust3r) and the specific, metric-scaled needs of autonomous driving.
## The Problem: The "Calibration Trap"
Traditional SOTA occupancy models are "sensor-locked." They rely on fixed camera intrinsics and extrinsics to lift 2D features into 3D space. This creates a massive bottleneck: if the sensor rig changes even slightly, the model's performance collapses. Furthermore, urban scenes are cluttered, and sparse LiDAR supervision often leaves "holes" in the predicted geometry, particularly in non-visible regions.
## Methodology: The "Secret Sauce" of OccAny
OccAny tackles these challenges through two primary innovations:
### 1. Segmentation Forcing
The authors realized that geometric data alone is often too sparse to supervise dense occupancy. By introducing **Segmentation Forcing**, they distill high-fidelity semantic cues from foundation models (like SAM2) into the geometry-focused backbone. This regularizes the geometry prediction, using semantic consistency (e.g., "this is a continuous car surface") to fill in the gaps where LiDAR signals are absent.
### 2. Novel View Rendering (NVR) & TTVA
To solve the problem of occlusion and "invisible" geometry, OccAny doesn't just predict what it sees. It uses a **Novel View Rendering** pipeline to hallucinate the scene from arbitrary new perspectives at test-time (**Test-time View Augmentation**). This allows the model to "peek" around corners and densify the voxel grid, leading to much more complete reconstructions.

*Figure 1: The two-stage training process: 3D Reconstruction with Segmentation Forcing followed by Novel-View Rendering.*
## Experimental Prowess
The results are striking. In a **Zero-shot** setting (trained on Waymo/ONCE/etc. and tested on SemanticKITTI), OccAny outperforms existing specialized baselines.
* **Metric Accuracy**: Unlike scale-invariant models, OccAny predicts metric pointmaps natively.
* **Generalization**: On SemanticKITTI monocular tasks, it reached a **24.03% IoU**, beating self-supervised models that were actually trained *on* that dataset.
* **Versatility**: The SAME model handles monocular, sequential, and surround-view inputs without reconfiguration.

*Table 1: OccAny's superior performance across benchmarks compared to geometric foundation model baselines.*
## Deep Insights & Critical Analysis
The most profound takeaway from OccAny is the **utility of "test-time imagination."** By allowing the model to render novel views during inference, the authors effectively converted a 3D completion problem into a multi-view consistency problem.
**Limitations**: Despite its strengths, a gap still exists compared to fully-supervised, in-domain models. Furthermore, while the NVR pipeline is efficient, performing 50+ view augmentations in real-time monocular settings remains a computational challenge for edge devices.
## Conclusion
OccAny proves that we are entering the era of "Unconstrained Perception." By moving away from rigid geometric priors and toward semantic-aware, rendering-capable foundation models, we can finally build autonomous systems that understand the 3D world as flexibly as humans do.

*Figure 2: Qualitative comparison showing OccAny's superior density and accuracy in voxel prediction.*
