VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs

[ICLR 2025] VectorWorld: Breaking the Latency and Stability Barriers in Driving Simulation

总结

问题

方法

结果

要点

摘要

VectorWorld is a streaming world model designed for autonomous driving simulation that generates ego-centric lane-agent vector-graph tiles on-demand. By combining an edge-gated relational Diffusion Transformer (DiT) with a solver-free MeanFlow generator, it achieves SOTA results in map-structure fidelity and supports stable, real-time 1km+ closed-loop rollouts.

TL;DR

VectorWorld is a SOTA streaming world model that enables kilometer-scale, closed-loop autonomous driving simulations in real-time. By representing the world as a dynamic vector graph and using a novel one-step MeanFlow generator, it eliminates the "jerk" of history-free starts and the latency of multi-step diffusion. It processes a 64m x 64m map tile in just 6ms, allowing it to "outpaint" the world as the ego-vehicle drives.

Problem & Motivation: The Reality Gap in Closed-Loop Simulation

Most generative world models (like GAIA-1 or DriveGAN) excel at producing pretty videos but fail as simulator backends. The authors identify three "deployment constraints" that standard metrics ignore:

Cold-Start Mismatch: Most models generate a static snapshot at $t = 0$ . Real-world driving policies rely on history (velocity, curvature). Starting from "zero-history" causes massive jerk spikes.
The Diffusion Latency Wall: Traditional diffusion models require dozens of denoising steps. If the simulator takes 200ms to generate the next stretch of road, it cannot support real-time interaction.
Compounding Infeasibility: Small kinematic errors (e.g., a car sliding slightly sideways) are invisible in short clips but cause "drifts" that break the simulation over 1km+ distances.

System Overview and Deployment Gaps

Methodology: Vector Graphs and Fast Transport

1. The Interaction-State Interface (Motion-Aware VAE)

To solve the "cold-start" problem, VectorWorld doesn't just generate a car; it generates an interaction state. This includes a 7D static state and a compact "motion code" (recent history polyline). A gated VAE learns to suppress noise for parked cars while preserving the momentum of moving ones.

2. Edge-Gated Relational DiT

Maps are graphs, not pixels. VectorWorld uses an Edge-Gated DiT that modulates attention based on lane connectivity (predecessor, successor, neighbor). This ensures that when the model generates a new tile, the lanes actually connect to the previous ones—a task standard Vision Transformers often fail.

Edge-Gated Relational DiT Architecture

3. One-Step MeanFlow with JVP

This is the "secret sauce" for speed. Instead of standard diffusion, the authors use MeanFlow trained with a Jacobian-vector product (JVP) objective. This supervises the model to be accurate even in a single large jump from noise to data. The result? A single forward pass generates a consistent map tile in 6ms.

Experiments: Validating the Long Horizon

The model was tested on Waymo and nuPlan datasets. Unlike previous work that only looks at the first 10 seconds, VectorWorld was pushed to 1km+ rollouts.

Map Fidelity: Endpoint distance (a measure of lane gaps) dropped by 68% compared to SLEDGE and ScenDream.
Dynamic Stability: Using the history-aware "WarmStart" reduced ego vehicle jerk from 16.6 to 9.6.
Policy Training: A planning agent (PPO) trained in VectorWorld's simulated environment saw its success rate jump from 25.0% to 56.0% when tested in difficult scenarios.

Qualitative Results on Waymo and nuPlan

Critical Insight & Conclusion

VectorWorld shifts the paradigm from "video generation" to "structured graph completion." By respecting the physics of motion (via ΔSim) and the topology of roads (via Edge-Gating), it proves that we don't need heavy, multi-step diffusion models to create immersive, high-fidelity driving environments.

Future Outlook: While it handles centerlines perfectly, adding semantic layers like curb heights or fine-grained texture for end-to-end sensor simulation (camera/LiDAR) would be the natural next step for this framework.

发现相似论文

试试这些示例

Search for recent autonomous driving world models that utilize vector-graph representations instead of rasterized BEV images for scene generation.
What are the theoretical foundations of MeanFlow and its application in one-step generative modeling as first proposed in 2025?
Identify other studies that use Jacobian-vector products (JVP) to supervise large-step transport or accelerate diffusion model sampling in robotics or simulation.

[ICLR 2025] VectorWorld: Breaking the Latency and Stability Barriers in Driving Simulation

1. TL;DR

2. Problem & Motivation: The Reality Gap in Closed-Loop Simulation

3. Methodology: Vector Graphs and Fast Transport

3.1. 1. The Interaction-State Interface (Motion-Aware VAE)

3.2. 2. Edge-Gated Relational DiT

3.3. 3. One-Step MeanFlow with JVP

4. Experiments: Validating the Long Horizon

5. Critical Insight & Conclusion