VectorWorld is a streaming world model designed for autonomous driving simulation that generates ego-centric lane-agent vector-graph tiles on-demand. By combining an edge-gated relational Diffusion Transformer (DiT) with a solver-free MeanFlow generator, it achieves SOTA results in map-structure fidelity and supports stable, real-time 1km+ closed-loop rollouts.
TL;DR
VectorWorld is a SOTA streaming world model that enables kilometer-scale, closed-loop autonomous driving simulations in real-time. By representing the world as a dynamic vector graph and using a novel one-step MeanFlow generator, it eliminates the "jerk" of history-free starts and the latency of multi-step diffusion. It processes a 64m x 64m map tile in just 6ms, allowing it to "outpaint" the world as the ego-vehicle drives.
Problem & Motivation: The Reality Gap in Closed-Loop Simulation
Most generative world models (like GAIA-1 or DriveGAN) excel at producing pretty videos but fail as simulator backends. The authors identify three "deployment constraints" that standard metrics ignore:
- Cold-Start Mismatch: Most models generate a static snapshot at . Real-world driving policies rely on history (velocity, curvature). Starting from "zero-history" causes massive jerk spikes.
- The Diffusion Latency Wall: Traditional diffusion models require dozens of denoising steps. If the simulator takes 200ms to generate the next stretch of road, it cannot support real-time interaction.
- Compounding Infeasibility: Small kinematic errors (e.g., a car sliding slightly sideways) are invisible in short clips but cause "drifts" that break the simulation over 1km+ distances.

Methodology: Vector Graphs and Fast Transport
1. The Interaction-State Interface (Motion-Aware VAE)
To solve the "cold-start" problem, VectorWorld doesn't just generate a car; it generates an interaction state. This includes a 7D static state and a compact "motion code" (recent history polyline). A gated VAE learns to suppress noise for parked cars while preserving the momentum of moving ones.
2. Edge-Gated Relational DiT
Maps are graphs, not pixels. VectorWorld uses an Edge-Gated DiT that modulates attention based on lane connectivity (predecessor, successor, neighbor). This ensures that when the model generates a new tile, the lanes actually connect to the previous ones—a task standard Vision Transformers often fail.

3. One-Step MeanFlow with JVP
This is the "secret sauce" for speed. Instead of standard diffusion, the authors use MeanFlow trained with a Jacobian-vector product (JVP) objective. This supervises the model to be accurate even in a single large jump from noise to data. The result? A single forward pass generates a consistent map tile in 6ms.
Experiments: Validating the Long Horizon
The model was tested on Waymo and nuPlan datasets. Unlike previous work that only looks at the first 10 seconds, VectorWorld was pushed to 1km+ rollouts.
- Map Fidelity: Endpoint distance (a measure of lane gaps) dropped by 68% compared to SLEDGE and ScenDream.
- Dynamic Stability: Using the history-aware "WarmStart" reduced ego vehicle jerk from 16.6 to 9.6.
- Policy Training: A planning agent (PPO) trained in VectorWorld's simulated environment saw its success rate jump from 25.0% to 56.0% when tested in difficult scenarios.

Critical Insight & Conclusion
VectorWorld shifts the paradigm from "video generation" to "structured graph completion." By respecting the physics of motion (via ΔSim) and the topology of roads (via Edge-Gating), it proves that we don't need heavy, multi-step diffusion models to create immersive, high-fidelity driving environments.
Future Outlook: While it handles centerlines perfectly, adding semantic layers like curb heights or fine-grained texture for end-to-end sensor simulation (camera/LiDAR) would be the natural next step for this framework.
