WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[CVPR 2025] UFO: Scaling 4D Driving Scene Reconstruction to Long-Range Sequences
总结
问题
方法
结果
要点
摘要

UFO is a novel recurrent 4D reconstruction framework for large-scale driving scenes that unifies feed-forward efficiency with optimization-based refinement. By maintaining a persistent set of scene tokens and decoding them into 3D Gaussians, it achieves SOTA performance on the Waymo Open Dataset, reconstructing 16-second sequences in under 0.5 seconds.

TL;DR

UFO (Unifying Feed-Forward and Optimization) breaks the duration barrier in autonomous driving simulation. By treating 4D reconstruction as a recurrent "token refinement" task rather than a one-shot prediction, it achieves near-linear complexity. It can reconstruct a 16-second high-fidelity 4D world in just 0.5 seconds, outperforming traditional optimization-based methods that take hours.

Background: The Scalability Wall

Reconstructing a dynamic driving environment from video is essential for "Closed-loop Simulation"—where an AI driver interacts with a reconstructed world. Currently, researchers are stuck between two sub-optimal paths:

  1. Per-scene Optimization: High quality, but requires hours of "training" for every 10 seconds of road.
  2. Feed-forward Transformers: Fast and generalizable, but they "choke" on long sequences because the memory and compute costs explode quadratically () as the driving log grows.

UFO bridges this gap by mimicking how humans perceive space: we don't remember every pixel of every second; we maintain a mental map and update it with new visual evidence.

Methodology: The Recurrent Scene Token Paradigm

1. Token-Based Memory

Instead of predicting a static cloud of Gaussians, UFO maintains Scene Tokens. Each token is a D-dimensional vector encoding geometry, appearance, and motion. This memory persists across time steps, allowing the model to "remember" a building it saw 10 seconds ago.

2. Visibility-Based Filtering (The Efficiency Secret)

To prevent the complexity, UFO uses a Visibility-Based Filtering mechanism. At each frame, the model only "looks" at the tokens (defaults to 3600) that are actually inside the current camera's view frustum.

  • Result: The compute cost stays constant regardless of whether the total sequence is 2 seconds or 20 minutes long.

UFO Framework Architecture Figure 1: The UFO framework. (A) Recurrent updates, (B) Visibility filtering for linear scaling, and (C) Pose-guided dynamic modeling.

3. Handling "Ghosts" and Dynamic Objects

Modeling a car turning at an intersection is hard. UFO uses 3D bounding boxes to guide Gaussian motion but adds a Soft Assignment and Temporal Lifespan ().

  • If the model sees something transient (like lens flare or a pedestrian's limb moving unpredictably), it assigns a short "lifespan," letting those Gaussians fade away.
  • This prevents the "ghosting" artifacts common in earlier 4DGS works.

Experimental Results: SOTA Performance

UFO was tested on the Waymo Open Dataset (WOD). It doesn't just beat other feed-forward models; it beats them while using significantly less memory.

| Method | Sequence Length | PSNR (Quality) ↑ | Inference Time ↓ | | :--- | :--- | :--- | :--- | | 3DGS (Opt.) | 16s | 17.18 | Hours | | STORM (FF) | 16s | 22.02 | ~1.5s | | UFO (Ours) | 16s | 27.04 | 0.48s |

Qualitative Results Comparison Figure 2: Qualitative comparison showing UFO's superior detail in distant structures and sharper dynamic objects compared to STORM.

Zero-Shot Generalization

Perhaps the most impressive feat is UFO's ability to handle 16-second sequences even when it was only trained on 8-second clips. This suggests the "recurrent update" logic has truly learned the underlying physics of scene persistence.

Deep Insight: Why It Works

Traditional Feed-forward models (like STORM or GS-LRM) try to solve the entire scene as a single "translation" problem (Images 3D). UFO treats it as a State Estimation problem. By forcing the model to transform tokens into a local camera-centric coordinate system at each step, the authors successfully stabilized the training of long-range transformers, which usually suffer from numerical instability as the ego-vehicle travels kilometers away from the origin.

Conclusion & Limitations

UFO is a major step toward real-time digital twins for autonomous driving. It provides a blueprint for how to use Transformers to "learn how to optimize."

Limitations: The method still heavily relies on off-the-shelf 3D object detectors for the initial "boxes." If the detector misses a vehicle, UFO might struggle to assign the correct motion. Future work likely involves a "detect-reconstruct" joint loop where the 4D reconstruction helps improve object detection in a virtuous cycle.

Dynamic Object Visualization Figure 3: Visualization of motion assignment and lifespan maps. Blue tones indicate transient objects with short lifespans, filtered naturally by the loss function.

发现相似论文

试试这些示例

  • Search for recent papers using recurrent memory or "scene tokens" to solve the quadratic complexity of Transformers in long-sequence 3D/4D reconstruction.
  • Identify the origin of "Gaussian Lifespan" parameters in dynamic scene modeling and how UFO's implementation differs from the original PVG (Periodic Vibration Gaussian) approach.
  • Investigate how the UFO architecture can be extended to multi-modal sensor fusion, specifically integrating Radar or 4D Imaging Radar into the recurrent scene token update.
目录
[CVPR 2025] UFO: Scaling 4D Driving Scene Reconstruction to Long-Range Sequences
1. TL;DR
2. Background: The Scalability Wall
3. Methodology: The Recurrent Scene Token Paradigm
3.1. 1. Token-Based Memory
3.2. 2. Visibility-Based Filtering (The Efficiency Secret)
3.3. 3. Handling "Ghosts" and Dynamic Objects
4. Experimental Results: SOTA Performance
4.1. Zero-Shot Generalization
5. Deep Insight: Why It Works
6. Conclusion & Limitations