This paper introduces Hyper Diffusion Planner (HDP), a large-scale end-to-end autonomous driving (E2E AD) framework that utilizes a diffusion-based decoder for trajectory planning. Evaluated via 200 km of real-world road testing, HDP achieves a 10x performance improvement over baseline diffusion planners by optimizing loss space, trajectory representation, and data scaling.
TL;DR
The Hyper Diffusion Planner (HDP) is a breakthrough in End-to-End (E2E) Autonomous Driving that transitions diffusion-based planning from "simulation-only" to "real-road-ready." By systematically optimizing the diffusion loss space and introducing a mathematically grounded Hybrid Loss (coupling velocity and waypoints), the researchers achieved a 10x performance boost in closed-loop real-world testing (200 km).
Problem & Motivation: The Gap Between Math and Asphalt
While Diffusion Models are the "SOTA" for image generation and robotic manipulation, applying them to autonomous driving (AD) reveals three critical "pain points":
- Jitter vs. Geometry: Models supervised on waypoints capture the path well but produce jerky, un-drivable velocity profiles.
- Mode Collapse: On small datasets (like 100k frames), diffusion planners often fail to show their famous multi-modality, behaving like simple regression models.
- Safety Gap: Imitation learning blindly copies human behavior, including mistakes, and lacks a mechanism to prioritize "not crashing" in long-tail scenarios.
Methodology: The Core Innovations
1. Re-thinking the Loss Space
Most diffusion models predict the noise (). However, HDP finds that in AD, trajectories live on a low-dimensional manifold. Predicting the clean data () directly leads to faster convergence and eliminates high-frequency artifacts common in -prediction.
2. The Hybrid Loss (Velocity + Waypoints)
To solve the "jitters," the authors predict velocity but supervise on both velocity and waypoints. They mathematically prove that this formulation—termed a P-norm Score Matching loss—is unbiased and maintains the integrity of the data distribution while ensuring both global geometric accuracy and local kinematic smoothness.
Fig 1: The HDP Architecture featuring a Perception Backbone and a Transformer-based Diffusion Decoder.
3. Safety-Aware RL Post-Training
To refine the model without expensive online real-vehicle RL, HDP uses a "pseudo-closed-loop" simulation. It applies Reward-Weighted Regression: This "up-weights" safe trajectories in the training data, aligning the model with safety constraints without requiring complex gradient backpropagation through the denoising chain.
Experiments & Results: The Power of Scaling
The most striking result is the Emergence of Data Scaling. While benchmarks like NAVSIM suggest diffusion models don't show multimodality, HDP proves this is a data volume issue.
- Scaling Multi-modality: Diversified behaviors only emerge after crossing the ~10M frame threshold.
- Real-Vehicle Performance: Scaling from 10M to 70M frames improved success rates by over 20%.
Table 1: Step-by-step performance gains from Base Model to HDP-RL.
In 200 km of urban testing, HDP handled complex "Navigational Lane Changes" and "VRU Yielding" with human-like smoothness, which was previously a major weakness for E2E learning models.
Fig 2: Snapshots of HDP successfully performing unprotected turns and yielding to cross traffic.
Critical Analysis & Conclusion
Takeaway: HDP demonstrates that successful E2E AD doesn't require complex, hand-crafted heuristics (like anchor trajectories). Instead, it requires a theoretically sound loss function and significant data scale.
Limitations:
- The current RL reward focuses primarily on safety, which can sometimes lead to overly "conservative" driving (e.g., waiting too long at intersections).
- Future work needs to balance safety with traffic efficiency to make the agent more assertive in dense traffic.
Future Outlook: HDP sets a new baseline for "Generalizable AD." By showing that diffusion models scale as well as LLMs, it opens the door for Large Foundation Models in the physical world.
