StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

StreetForward: Perceiving Dynamic Street with Feedforward Causal Attention

[CVPR 2026] StreetForward: Revolutionizing Dynamic Street Reconstruction with Causal Feedforward Attention

总结

问题

方法

结果

要点

摘要

StreetForward is a pose-free and tracker-free feedforward framework for dynamic street reconstruction. It utilizes a novel Causal Masked Attention mechanism within a Transformer-based architecture to estimate 3D Gaussian Splatting (3DGS) representations and per-pixel velocities, achieving SOTA depth estimation and high-fidelity 4D novel view synthesis.

Executive Summary

TL;DR: StreetForward is a breakthrough in 4D autonomous driving simulation. It reconstructs complex, dynamic street scenes from unposed images in a single feedforward pass—eliminating the need for time-consuming per-scene optimization, camera poses, or external trackers. By introducing Causal Masked Attention, it captures precise motion vectors and high-fidelity geometry, setting new benchmarks on the Waymo Open Dataset.

Positioning: This work moves beyond the "static-first" bias of previous Visual Geometry Transformers (VGGT). It bridges the gap between fast feedforward inference and the high-precision requirements of dynamic urban environments, serving as a foundational architecture for high-fidelity simulators and world models.

The "Unordered" Problem in 4D Reconstruction

Prior feedforward methods like VGGT were designed for Structure-from-Motion (SfM) tasks where image order doesn't necessarily matter. However, in a driving sequence, time is a vector. Existing "Alternating Attention" (AA) mechanisms treat tokens from all frames equally. This global symmetry is a weakness: it fails to represent the directed flow of motion from $t$ to $t + 1$ , leading to ghosting artifacts and "geometry degradation" where moving objects appear as translucent smears.

Methodology: Causal Dynamics & Gaussian Splatting

The core innovation of StreetForward lies in its Causal Dynamics Modeling. Instead of letting the Transformer look at every frame simultaneously, the authors inject a frame-structured binary mask into the attention heads.

1. Causal Masked Attention

The model restricts query-key pairs to specific source-target relations. This forces the latent representations to focus purely on the temporal delta between adjacent frames, making the features inherently "motion-aware."

Model Architecture Figure 1: The StreetForward pipeline. Note the Causal Dynamics module that decodes per-pixel velocities ( $v^{+}$ and $v^{-}$ ) to warp dynamic Gaussians across time.

2. Joint Static-Dynamic Optimization

StreetForward represents the world using 3D Gaussian Splatting (3DGS).

Static Content: Primitives persist across the whole sequence.
Dynamic Content: Primitives are transformed using the decoded per-pixel velocity field.
No Supervision: Crucially, the model learns these velocities without 4D labels, relying instead on Spatio-temporal Consistency (rendering the current frame using warped Gaussians from adjacent frames).

Experimental Results: Precision and Generalization

The results on the Waymo Open Dataset are striking, particularly in geometric accuracy.

Superior Geometry

StreetForward achieves a Full Image Depth RMSE of 3.14m, significantly outperforming the previous state-of-the-art DGGT (4.08m) and STORM (5.48m). The improvement in dynamic regions is even more pronounced (RMSE 3.45m vs DGGT's 6.37m).

Novel View Synthesis

When it comes to "seeing around corners" (spatial extrapolation) or predicting future states (temporal extrapolation), StreetForward produces much cleaner results than tracking-based methods.

Experimental Results Figure 2: Qualitative comparison. Notice how StreetForward reconstructs sharp, complete vehicle geometries and thin structures like traffic poles, whereas DGGT or STORM suffer from "sky-drifting" or blurred boundaries.

Zero-Shot Robustness

To test generalization, the authors ran the Waymo-trained model on CARLA data. StreetForward maintained its lead, proving that the causal motion priors it learns are physically grounded and not just overfitting to the Waymo sensor suite.

Critical Insight & Conclusion

The success of StreetForward highlights a pivotal shift: Inductive Bias Matters. While "General Transformers" are powerful, 4D reconstruction is a physical task governed by the arrow of time and the rigidity of objects. By baking these constraints into the attention mechanism (Causal Mask) and loss functions (Rigidity Regularization), StreetForward achieves what scaling alone could not.

Limitations: While the model handles rigid and deformable objects (like pedestrians), very high-speed motion across extreme frame drops might still challenge the constant-velocity assumption used in the temporal loss.

Future Outlook: StreetForward paves the way for "Instant Digital Twins" of any city street captured by a single pass of a camera-equipped vehicle, a holy grail for testing autonomous driving stacks in closed-loop simulations.

发现相似论文

试试这些示例

Search for recent papers that use causal attention or temporal masking in Visual Geometry Transformers for 4D scene reconstruction.
Which paper first proposed the Alternating Attention (AA) mechanism in VGGT, and how does StreetForward's causal mask specifically modify its theoretical complexity?
Identify research that applies feedforward 3D Gaussian Splatting to real-time closed-loop autonomous driving simulations.

[CVPR 2026] StreetForward: Revolutionizing Dynamic Street Reconstruction with Causal Feedforward Attention

1. Executive Summary

2. The "Unordered" Problem in 4D Reconstruction

3. Methodology: Causal Dynamics & Gaussian Splatting

3.1. 1. Causal Masked Attention

3.2. 2. Joint Static-Dynamic Optimization

4. Experimental Results: Precision and Generalization

4.1. Superior Geometry

4.2. Novel View Synthesis

4.3. Zero-Shot Robustness

5. Critical Insight & Conclusion