Vega: Learning to Drive with Natural Language Instructions

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Vega: Learning to Drive with Natural Language Instructions

[CVPR 2025] Vega: Bridging the Gap Between Natural Language and Autonomous Action

Summary

Problem

Method

Results

Takeaways

Abstract

Vega is a unified vision-language-world-action model designed for instruction-based autonomous driving. By integrating an autoregressive transformer for multimodal understanding and a diffusion transformer for trajectory planning and scene generation, it achieves state-of-the-art results on the NAVSIM v2 benchmark (89.4 EPDMS).

TL;DR

Autonomous driving is evolving from "copying the driver" (Imitation Learning) to "listening to the passenger." Tsinghua University and GigaAI present Vega, a Vision-Language-Action (VLA) model that can perform complex tasks like "overtake the front car to catch the green light" based on natural language. By treating future video prediction as a "world modeling" task, Vega provides the dense supervision needed to map human instructions to precise steering and acceleration.

The "Instruction Following" Problem

Most current End-to-End driving models focus on imitating an "averaged expert." While this works for standard navigation, it lacks personalization and flexibility. If a user says "Pull up to the side and wait," a standard model might ignore the command because it wasn't in the expert's "average" behavior.

The technical bottleneck is the Information Disparity: instructions are high-dimensional and complex, while actions (steering, throttle) are low-dimensional. Training a model with only action labels (sparse supervision) makes it hard for the AI to "understand" why it should change its behavior based on a sentence.

Methodology: World Modeling as Dense Supervision

Vega's core insight is that to act correctly, a model must first imagine the outcome. By training the model to predict the next frame (World Modeling), the authors provide pixel-level feedback. If the instruction is "Turn left," the model must generate an image of a left turn, forcing it to learn the causal link between the text and the physical environment.

The Integrated Transformer Architecture

Vega uses a Joint Autoregressive-Diffusion Architecture.

Autoregressive Pipeline: Handles the understanding of past images and text instructions.
Diffusion Pipeline: Generates the noise-to-signal future trajectory and future image frames.

Model Architecture

The model utilizes a Mixture-of-Transformers (MoT). Unlike MoE, which only swaps Feed-Forward Networks (FFN), MoT uses dedicated transformer blocks for each modality (Visual, Text, Action). This prevents "parameter interference," where learning to generate images might otherwise degrade the model's ability to plan trajectories.

Experiments & Results: Beyond Imitation

The authors introduced InstructScene, a dataset of 100,000 scenes annotated with diverse instructions using GPT-4o. On the NAVSIM benchmarks, Vega outperformed leading VLA models:

NAVSIM v2: Achieved 89.4 EPDMS, showing superior adherence to traffic lights and lane keeping.
Ablation Study: Removing the future image prediction task caused performance to drop significantly (from 77.9 to 51.8 PDMS), proving that "World Modeling" is essential for instruction alignment.

Performance Comparison

Visual Evidence

As shown in the qualitative results, Vega can plan entirely different trajectories for the same intersection depending on whether the instruction is to "Follow the car" or "Stop at the crosswalk."

Instruction Visuals

Critical Analysis & Conclusion

Vega demonstrates that the future of autonomous driving lies in Unified Multimodal Models. By merging generative world modeling with action planning, we overcome the limitations of sparse data.

Limitations: Currently, the model focuses on front-view cameras. To reach L4 autonomy, expanding this architecture to multi-view 360-degree visions and 3D occupancy prediction will be the next frontier.

Takeaway: If you want your agent to follow instructions, don't just teach it what to do; teach it what the world should look like after it performs the action.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize World Models to provide dense supervision for Reinforcement Learning or Imitation Learning in robotics.
Which paper originally introduced the Mixture-of-Transformers (MoT) architecture, and how does it compare to standard Mixture-of-Experts (MoE) in multimodal tasks?
Find studies that evaluate the robustness of instruction-following driving agents against adversarial or ambiguous natural language commands.

Contents

[CVPR 2025] Vega: Bridging the Gap Between Natural Language and Autonomous Action

1. TL;DR

2. The "Instruction Following" Problem

3. Methodology: World Modeling as Dense Supervision

3.1. The Integrated Transformer Architecture

4. Experiments & Results: Beyond Imitation

4.1. Visual Evidence

5. Critical Analysis & Conclusion