ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction

[ArXiv 2024] ReconDrive: Breaking the Scalability Barrier in 4D Driving Scene Reconstruction

总结

问题

方法

结果

要点

摘要

ReconDrive is a fast feed-forward 4D Gaussian Splatting (4DGS) framework designed for large-scale autonomous driving scene reconstruction. By extending the VGGT foundation model with hybrid prediction heads and a static-dynamic composition strategy, it achieves high-fidelity novel-view synthesis and reconstruction in a single forward pass, significantly outperforming prior feed-forward methods and competing with per-scene optimization baselines.

TL;DR

ReconDrive is a breakthrough feed-forward framework that generates high-fidelity 4D Gaussian Splatting (4DGS) representations for autonomous driving in a single forward pass. By eliminating the need for per-scene optimization, it reduces the reconstruction time from 30 minutes to 15 seconds while achieving SOTA results in visual fidelity and downstream 3D perception tasks.

The Scalability Bottleneck: Beyond "One Scene at a Time"

The industry's move toward "Closed-Loop Simulation" requires the ability to reconstruct thousands of real-world miles into digital twins. However, the current SOTA—4D Gaussian Splatting—has a fatal flaw: Optimization Dependency.

Current methods like Street Gaussians require iterative refinement for every single new scene. This is a "non-data-driven" paradigm that ignores the common structural priors of urban environments (roads are flat, cars have four wheels). Conversely, early "feed-forward" (data-driven) models struggled with metric accuracy and blurry textures.

ReconDrive's Intuition: Can we use a 3D Foundation Model (like VGGT) as a "structural brain" but feed it specific "photometric eyes" and "calibration maps" to achieve both speed and precision?

Methodology: The Three Pillars of ReconDrive

1. Hybrid Gaussian Prediction Heads

ReconDrive identifies that transformer features from foundation models are great for geometry but poor for appearance. To fix this, they designed a dual-head system:

GCPH (Center Head): Explicitly incorporates camera calibration to pro ject depths into metric 3D space, solving the "scale ambiguity" common in general models.
GPPH (Parameter Head): Uses a "shortcut connection" (residual-like) to inject raw high-resolution image details directly into the attribute regression, ensuring the resulting Gaussians have crisp colors and sharp opacities.

ReconDrive Inference Framework

2. Static-Dynamic 4D Composition

Instead of treating the world as a rigid block, ReconDrive uses SAM2 (Segment Anything 2) to mask dynamic agents. It then estimates a linear velocity vector for these dynamic Gaussians. $μ_{i} (t) = μ_{i, ini t} + v_{i} \cdot (t - T_{s})$ This explicit motion modeling allows the model to "predict" where a car will be in future frames without needing to re-process the entire scene.

3. Segment-wise Temporal Fusion

To handle long driving sequences (20s+), the system breaks the scene into temporal segments. It fuses Gaussian clusters from adjacent context frames into a unified 4D representation, ensuring smooth transitions and temporal consistency.

Experimental Results: Faster and Sharper

The authors benchmarked ReconDrive on nuScenes, comparing it against both the "slow but steady" optimization methods and "fast but blurry" feed-forward models.

| Method | PSNR (Reconstruction) | Inference Speed | | :--- | :---: | :---: | | Street Gaussians (Opt.) | 29.18 | 31 min | | DrivingForward (FF) | 22.83 | 5 s | | ReconDrive (Ours) | 32.66 | 15 s |

Visual Comparison

As seen in the visual results, ReconDrive maintains sharp details on trees and vehicle boundaries even during lateral movement (novel-view synthesis), where other methods often introduce severe "floaters" or artifacts.

Critical Analysis & Future Outlook

ReconDrive marks a significant shift: it is the first time a feed-forward model has outperformed optimization-based models in both PSNR and perception metrics (mAP, AMOTA).

Limitations:

Rigid Motion Assumption: The model assumes linear motion, which might fail for sharp turns or non-rigid deformations (e.g., pedestrians waving hands).
Background Holes: When a vehicle moves, it leaves a "hole" in the background that requires better inpainting or multi-frame fusion to fill perfectly.

The Bigger Picture: By proving that we can "generate" a 4D scene representation in near real-time, ReconDrive paves the way for generative simulation environments that can be created on-the-fly, a holy grail for testing End-to-End autonomous driving stacks.

Senior Editor's Note: This work effectively navigates the Pareto front of speed vs. quality. The use of DINOv2 tokens for geometry and raw pixel shortcuts for appearance is a classic "engineering-meets-foundation-model" insight that works exceptionally well here.

发现相似论文

试试这些示例

Search for recent papers that utilize 3D foundation models like VGGT or DUSt3R specifically for dynamic scene reconstruction in autonomous driving.
Which paper first introduced the concept of 4D Gaussian Splatting, and how does ReconDrive's velocity-based motion modeling differ from the original deformation fields?
Investigate how feed-forward Gaussian Splatting methods handle occlusions and disocclusion artifacts compared to diffusion-based refinement techniques.

[ArXiv 2024] ReconDrive: Breaking the Scalability Barrier in 4D Driving Scene Reconstruction

1. TL;DR

2. The Scalability Bottleneck: Beyond "One Scene at a Time"

3. Methodology: The Three Pillars of ReconDrive

3.1. 1. Hybrid Gaussian Prediction Heads

3.2. 2. Static-Dynamic 4D Composition

3.3. 3. Segment-wise Temporal Fusion

4. Experimental Results: Faster and Sharper

5. Critical Analysis & Future Outlook