WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
RAD-2: Scaling RL via Generator-Discriminator Synergy in Autonomous Driving
总结
问题
方法
结果
要点
摘要

RAD-2 is a unified generator-discriminator framework for autonomous driving that combines a diffusion-based trajectory generator with a Reinforcement Learning (RL) optimized discriminator. It achieves a state-of-the-art reduction in collision rates (over 56%) by decoupling high-dimensional trajectory generation from low-dimensional preference evaluation in a shared closed-loop simulation.

TL;DR

RAD-2 is a breakthrough framework that successfully scales Reinforcement Learning (RL) for motion planning by decoupling the problem into two parts: a Diffusion Generator that explores the trajectory manifold and an RL Discriminator that selects the safest path. By utilizing a high-throughput BEV-Warp simulator and a novel TC-GRPO optimization strategy, RAD-2 slashes collision rates by 56% while maintaining human-like driving efficiency.

Problem: The "Action Space" Curse in Driving RL

Reinforcement Learning in autonomous driving has long faced a fundamental contradiction. To be safe, a model needs a continuous, high-dimensional action space (trajectories). However, RL optimization thrives on low-dimensional signals. When you try to map a sparse reward (e.g., "you crashed") back to a complex 5-second trajectory, the Credit Assignment problem becomes an insurmountable wall.

Previous diffusion-based planners (like DiffusionDrive or ResAD) relied purely on Imitation Learning (IL). While great at mimicking human "average" behavior, they lack a mechanism to learn from mistakes or understand why a specific maneuver was safer than another in a closed-loop interactive environment.

Methodology: Divide, Rerank, and Conquer

RAD-2 introduces a "Reviewer-Author" architecture. Instead of asking RL to "write" the trajectory, RAD-2 asks RL to "grade" trajectories written by a pre-trained expert (the Diffusion model).

1. The Generator-Discriminator Framework

  • Generator (): A Diffusion-based model (DiT) that outputs diverse candidate trajectories. It handles the "multimodality"—deciding whether to go left or right around an obstacle.
  • Discriminator (): A Transformer-based head that takes the candidates and the BEV scene context to output a preference score. This is where the RL magic happens.

RAD-2 Architecture

2. TC-GRPO: Stabilizing the Gradient

The authors adapt the Group Relative Policy Optimization (GRPO)—recently made famous by DeepSeek—to driving. They introduce Temporal Consistency (TC):

  • Latched Execution: A chosen trajectory is reused for a short horizon () to ensure the vehicle doesn't "jitter" between modes.
  • Structured Advantage: Rewards are computed relative to a group of rollouts starting from the same state, effectively denoising the signal.

3. BEV-Warp: The High-Throughput Engine

To train RL at scale, you need speed. Traditional simulators (CARLA) are slow due to rendering. RAD-2 uses BEV-Warp, which performs spatial transformations directly on the internal 2D feature maps of the perception system. This allows for closed-loop simulation at massive throughput without the overhead of image generation.

BEV-Warp Mechanism

Experiments: Proving the Gains

RAD-2 was tested against heavyweights like VADv2 and TransFuser. The results in the BEV-Warp environment were striking:

  • Safety: Collision Rate dropped from 0.533 (ResAD) to 0.234.
  • Efficiency: Efficiency-oriented scenarios saw a route completion (EP@1.0) jump from 51.6% to 73.6%.

In real-world qualitative tests, RAD-2 demonstrated "proactive deceleration." While baseline models often waited too late to react to merging vehicles, RAD-2's discriminator identified the low-reward (high-risk) outcome early and selected a smoother, safer alternative.

Performance Comparison

Critical Insight: Why Does This Work?

The brilliance of RAD-2 lies in its Inference-time Scaling. Because the discriminator is trained to rank, you can increase the number of "guesses" () at test time. As increases from 8 to 128, the model's performance continues to climb without any retraining—a property reminiscent of "Reasoning Models" (like OpenAI's o1) but applied to the physical domain of driving.

Conclusion

RAD-2 proves that the key to scalable RL in robotics isn't bigger models, but better architectural decoupling. By letting Diffusion do the "dreaming" and RL do the "judging," RAD-2 bridges the gap between imitation and intelligent, closed-loop interaction.

Limitations: The framework currently relies on BEV-centric features. Future work will need to adapt this spatial warping logic to unified latent-space world models to support more diverse sensor architectures.

发现相似论文

试试这些示例

  • Search for recent papers that use Group Relative Policy Optimization (GRPO) outside of Large Language Models, specifically in robotics or continuous control tasks.
  • Which paper first proposed the concept of "latched execution" or "trajectory reuse" to solve credit assignment in RL-based motion planning?
  • Investigate how spatial warping in Bird's-Eye View (BEV) feature space compares to generative world models in terms of sim-to-real gap and computational throughput.
目录
RAD-2: Scaling RL via Generator-Discriminator Synergy in Autonomous Driving
1. TL;DR
2. Problem: The "Action Space" Curse in Driving RL
3. Methodology: Divide, Rerank, and Conquer
3.1. 1. The Generator-Discriminator Framework
3.2. 2. TC-GRPO: Stabilizing the Gradient
3.3. 3. BEV-Warp: The High-Throughput Engine
4. Experiments: Proving the Gains
5. Critical Insight: Why Does This Work?
6. Conclusion