RAD-2 is a unified generator-discriminator framework for autonomous driving that combines a diffusion-based trajectory generator with a Reinforcement Learning (RL) optimized discriminator. It achieves a state-of-the-art reduction in collision rates (over 56%) by decoupling high-dimensional trajectory generation from low-dimensional preference evaluation in a shared closed-loop simulation.
TL;DR
RAD-2 is a breakthrough framework that successfully scales Reinforcement Learning (RL) for motion planning by decoupling the problem into two parts: a Diffusion Generator that explores the trajectory manifold and an RL Discriminator that selects the safest path. By utilizing a high-throughput BEV-Warp simulator and a novel TC-GRPO optimization strategy, RAD-2 slashes collision rates by 56% while maintaining human-like driving efficiency.
Problem: The "Action Space" Curse in Driving RL
Reinforcement Learning in autonomous driving has long faced a fundamental contradiction. To be safe, a model needs a continuous, high-dimensional action space (trajectories). However, RL optimization thrives on low-dimensional signals. When you try to map a sparse reward (e.g., "you crashed") back to a complex 5-second trajectory, the Credit Assignment problem becomes an insurmountable wall.
Previous diffusion-based planners (like DiffusionDrive or ResAD) relied purely on Imitation Learning (IL). While great at mimicking human "average" behavior, they lack a mechanism to learn from mistakes or understand why a specific maneuver was safer than another in a closed-loop interactive environment.
Methodology: Divide, Rerank, and Conquer
RAD-2 introduces a "Reviewer-Author" architecture. Instead of asking RL to "write" the trajectory, RAD-2 asks RL to "grade" trajectories written by a pre-trained expert (the Diffusion model).
1. The Generator-Discriminator Framework
- Generator (): A Diffusion-based model (DiT) that outputs diverse candidate trajectories. It handles the "multimodality"—deciding whether to go left or right around an obstacle.
- Discriminator (): A Transformer-based head that takes the candidates and the BEV scene context to output a preference score. This is where the RL magic happens.

2. TC-GRPO: Stabilizing the Gradient
The authors adapt the Group Relative Policy Optimization (GRPO)—recently made famous by DeepSeek—to driving. They introduce Temporal Consistency (TC):
- Latched Execution: A chosen trajectory is reused for a short horizon () to ensure the vehicle doesn't "jitter" between modes.
- Structured Advantage: Rewards are computed relative to a group of rollouts starting from the same state, effectively denoising the signal.
3. BEV-Warp: The High-Throughput Engine
To train RL at scale, you need speed. Traditional simulators (CARLA) are slow due to rendering. RAD-2 uses BEV-Warp, which performs spatial transformations directly on the internal 2D feature maps of the perception system. This allows for closed-loop simulation at massive throughput without the overhead of image generation.

Experiments: Proving the Gains
RAD-2 was tested against heavyweights like VADv2 and TransFuser. The results in the BEV-Warp environment were striking:
- Safety: Collision Rate dropped from 0.533 (ResAD) to 0.234.
- Efficiency: Efficiency-oriented scenarios saw a route completion (EP@1.0) jump from 51.6% to 73.6%.
In real-world qualitative tests, RAD-2 demonstrated "proactive deceleration." While baseline models often waited too late to react to merging vehicles, RAD-2's discriminator identified the low-reward (high-risk) outcome early and selected a smoother, safer alternative.

Critical Insight: Why Does This Work?
The brilliance of RAD-2 lies in its Inference-time Scaling. Because the discriminator is trained to rank, you can increase the number of "guesses" () at test time. As increases from 8 to 128, the model's performance continues to climb without any retraining—a property reminiscent of "Reasoning Models" (like OpenAI's o1) but applied to the physical domain of driving.
Conclusion
RAD-2 proves that the key to scalable RL in robotics isn't bigger models, but better architectural decoupling. By letting Diffusion do the "dreaming" and RL do the "judging," RAD-2 bridges the gap between imitation and intelligent, closed-loop interaction.
Limitations: The framework currently relies on BEV-centric features. Future work will need to adapt this spatial warping logic to unified latent-space world models to support more diverse sensor architectures.
