DynamicVGGT is a unified feed-forward framework for 4D scene reconstruction in autonomous driving, extending the Visual Geometry Grounding Transformer (VGGT) to handle temporal dynamics. It achieves state-of-the-art results in point map accuracy and novel view synthesis on real-world datasets like Waymo and KITTI without requiring camera extrinsics or per-scene optimization.
TL;DR
DynamicVGGT transforms the static 3D perception of a Visual Geometry Grounding Transformer (VGGT) into a high-fidelity 4D dynamic reconstruction engine. By combining Motion-aware Temporal Attention (MTA) with a Dynamic 3D Gaussian Splatting Head, it recovers both geometry and motion trajectories from raw video. It achieves SOTA performance on Waymo and KITTI, proving that feed-forward models can handle complex urban dynamics without expensive per-scene optimization or camera extrinsics.
Problem & Motivation: The Dynamic Gap
While feed-forward 3D foundation models like DUSt3R and VGGT have revolutionized static scene understanding, they hit a wall in Autonomous Driving (AD). Real-world AD scenarios are not just static "frozen" moments; they are defined by moving vehicles, pedestrians, and ego-motion.
Existing methods face three primary hurdles:
- Temporal Inconsistency: Frame-by-frame 3D predictions often "flicker" because they lack an understanding of motion continuity.
- Sparsity of Real Data: Standard LiDAR supervision is too sparse and noisy for training dense, high-fidelity Gaussian reconstruction models.
- Optimization Overhead: Most dynamic Gaussian splatting methods require minutes or hours of per-scene optimization, making them useless for real-time feed-forward pipelines.
The authors' insight is to treat motion as a displacement field within a shared reference system, allowing the model to learn "where a point was" and "where it will be" through a unified Dynamic Point Map (DPM) representation.
Methodology: Encoding the 4th Dimension
DynamicVGGT builds on the VGGT backbone but introduces three critical innovations to handle time and motion.
1. Motion-aware Temporal Attention (MTA)
Instead of simply stacking temporal layers, MTA uses learnable motion tokens. These tokens act as "memory buffers" that encode motion priors across a sequence, guiding the spatial attention layers to focus on regions where movement is occurring. This maintains the stable geometric priors of the pretrained static model while adding temporal reasoning.
2. The Dynamic 3D Gaussian Splatting Head (DGSHead)
To move beyond simple point clouds, the model predicts 3D Gaussian primitives. Unlike static Gaussians, each primitive here carries a velocity vector ().
- Implicit Learning: The Future Point Head predicts the next frame's geometry to ensure inter-frame consistency.
- Explicit Learning: The DGSHead is supervised by Scene Flow, forcing the Gaussians to follow physically plausible trajectories.

3. Stage-wise Training & Depth Distillation
Training on real-world AD data can degrade performance due to LiDAR sparsity. The authors solve this with a Curriculum Learning strategy:
- Stage 1: Train on high-fidelity synthetic data (Virtual KITTI/MVS-Synth) to learn clean geometry.
- Stage 2: Fine-tune on real data (Waymo) using a Depth Distillation loss, where the Stage 1 model acts as a "teacher" to provide dense geometric guidance to the Gaussian head.
Experiments: Superior 4D Fidelity
DynamicVGGT was evaluated against top-tier baselines like StreamVGGT and STORM.
- Point Cloud Accuracy: On KITTI Monocular, it reduced Accuracy error from 1.489 (VGGT) to 0.901, a ~40% improvement.
- 4D Synthesis: Even without camera parameters or dense annotations, it achieved a PSNR of 18.07 on dynamic-only regions of Waymo, outperforming general feed-forward LRM models.

The visualization results (Fig 5 in the paper) show that while previous models struggle with "ghosting" artifacts on moving cars, DynamicVGGT maintains sharp, temporally consistent boundaries.
Critical Analysis & Future Outlook
Takeaway: DynamicVGGT successfully bridges the gap between static foundation models and dynamic 4D world modeling. Its reliance on "image-only" inputs makes it highly flexible for various sensor configurations.
Limitations: While the constant velocity assumption for Gaussians works for short clips (temporal offset to 3), it might struggle with rapid accelerations or long-term occlusions where linear motion breaks down.
Future Work: Integrating this framework into an End-to-End Driving Model could provide the "world model" capability needed for safe motion planning, allowing the vehicle to "imagine" future 3D states of the environment accurately.
Title: [CVPR 2025] DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving Status: SOTA on Waymo/KITTI point map reconstruction.
