DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

[CVPR 2025] DynamicVGGT: Unified Feed-Forward 4D Reconstruction for Autonomous Driving

总结

问题

方法

结果

要点

摘要

DynamicVGGT is a unified feed-forward framework for 4D scene reconstruction in autonomous driving, extending the Visual Geometry Grounding Transformer (VGGT) to handle temporal dynamics. It achieves state-of-the-art results in point map accuracy and novel view synthesis on real-world datasets like Waymo and KITTI without requiring camera extrinsics or per-scene optimization.

TL;DR

DynamicVGGT transforms the static 3D perception of a Visual Geometry Grounding Transformer (VGGT) into a high-fidelity 4D dynamic reconstruction engine. By combining Motion-aware Temporal Attention (MTA) with a Dynamic 3D Gaussian Splatting Head, it recovers both geometry and motion trajectories from raw video. It achieves SOTA performance on Waymo and KITTI, proving that feed-forward models can handle complex urban dynamics without expensive per-scene optimization or camera extrinsics.

Problem & Motivation: The Dynamic Gap

While feed-forward 3D foundation models like DUSt3R and VGGT have revolutionized static scene understanding, they hit a wall in Autonomous Driving (AD). Real-world AD scenarios are not just static "frozen" moments; they are defined by moving vehicles, pedestrians, and ego-motion.

Existing methods face three primary hurdles:

Temporal Inconsistency: Frame-by-frame 3D predictions often "flicker" because they lack an understanding of motion continuity.
Sparsity of Real Data: Standard LiDAR supervision is too sparse and noisy for training dense, high-fidelity Gaussian reconstruction models.
Optimization Overhead: Most dynamic Gaussian splatting methods require minutes or hours of per-scene optimization, making them useless for real-time feed-forward pipelines.

The authors' insight is to treat motion as a displacement field within a shared reference system, allowing the model to learn "where a point was" and "where it will be" through a unified Dynamic Point Map (DPM) representation.

Methodology: Encoding the 4th Dimension

DynamicVGGT builds on the VGGT backbone but introduces three critical innovations to handle time and motion.

1. Motion-aware Temporal Attention (MTA)

Instead of simply stacking temporal layers, MTA uses learnable motion tokens. These tokens act as "memory buffers" that encode motion priors across a sequence, guiding the spatial attention layers to focus on regions where movement is occurring. This maintains the stable geometric priors of the pretrained static model while adding temporal reasoning.

2. The Dynamic 3D Gaussian Splatting Head (DGSHead)

To move beyond simple point clouds, the model predicts 3D Gaussian primitives. Unlike static Gaussians, each primitive here carries a velocity vector ( $u_{i}$ ).

Implicit Learning: The Future Point Head predicts the next frame's geometry to ensure inter-frame consistency.
Explicit Learning: The DGSHead is supervised by Scene Flow, forcing the Gaussians to follow physically plausible trajectories.

Model Architecture

3. Stage-wise Training & Depth Distillation

Training on real-world AD data can degrade performance due to LiDAR sparsity. The authors solve this with a Curriculum Learning strategy:

Stage 1: Train on high-fidelity synthetic data (Virtual KITTI/MVS-Synth) to learn clean geometry.
Stage 2: Fine-tune on real data (Waymo) using a Depth Distillation loss, where the Stage 1 model acts as a "teacher" to provide dense geometric guidance to the Gaussian head.

Experiments: Superior 4D Fidelity

DynamicVGGT was evaluated against top-tier baselines like StreamVGGT and STORM.

Point Cloud Accuracy: On KITTI Monocular, it reduced Accuracy error from 1.489 (VGGT) to 0.901, a ~40% improvement.
4D Synthesis: Even without camera parameters or dense annotations, it achieved a PSNR of 18.07 on dynamic-only regions of Waymo, outperforming general feed-forward LRM models.

Experimental Results Contrast

The visualization results (Fig 5 in the paper) show that while previous models struggle with "ghosting" artifacts on moving cars, DynamicVGGT maintains sharp, temporally consistent boundaries.

Critical Analysis & Future Outlook

Takeaway: DynamicVGGT successfully bridges the gap between static foundation models and dynamic 4D world modeling. Its reliance on "image-only" inputs makes it highly flexible for various sensor configurations.

Limitations: While the constant velocity assumption for Gaussians works for short clips (temporal offset $δ = 1$ to 3), it might struggle with rapid accelerations or long-term occlusions where linear motion breaks down.

Future Work: Integrating this framework into an End-to-End Driving Model could provide the "world model" capability needed for safe motion planning, allowing the vehicle to "imagine" future 3D states of the environment accurately.

Title: [CVPR 2025] DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving Status: SOTA on Waymo/KITTI point map reconstruction.

发现相似论文

试试这些示例

Search for the latest feed-forward 4D scene reconstruction papers published in 2025 or 2026 that utilize 3D Gaussian Splatting for autonomous driving.
Which paper first introduced the concept of Dynamic Point Maps (DPM) and how does DynamicVGGT modify this representation for uncalibrated image sequences?
Investigate how depth distillation from geometric priors has been applied to stabilize 3D Gaussian Splatting in other large-scale outdoor datasets beyond Waymo.

[CVPR 2025] DynamicVGGT: Unified Feed-Forward 4D Reconstruction for Autonomous Driving

1. TL;DR

2. Problem & Motivation: The Dynamic Gap

3. Methodology: Encoding the 4th Dimension

3.1. 1. Motion-aware Temporal Attention (MTA)

3.2. 2. The Dynamic 3D Gaussian Splatting Head (DGSHead)

3.3. 3. Stage-wise Training & Depth Distillation

4. Experiments: Superior 4D Fidelity

5. Critical Analysis & Future Outlook