Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

[CVPR 2024] MOSAIC: Boosting Autonomous Driving with Scaling-Aware Data Selection

总结

问题

方法

结果

要点

摘要

The paper introduces MOSAIC, a scaling-aware data selection framework designed for End-to-End (E2E) autonomous driving. It combines data clustering, importance ranking, and neural scaling laws to optimize training mixtures, achieving SOTA performance on NAVSIM and OpenScene benchmarks while using up to 80% less data.

Executive Summary

Large-scale deep learning for Physical AI—such as End-to-End (E2E) autonomous driving—is a data-hungry endeavor. However, blindly collecting millions of hours of driving footage is computationally prohibitive and often leads to diminishing returns.

The core challenge is "Metric Competition." An autonomous vehicle isn't just trying to minimize a single loss function; it must simultaneously comply with traffic laws, ensure passenger comfort, and maintain progress. A specific dataset might help the model handle roundabouts better while inadvertently making its braking behavior more erratic.

In this work, researchers from NVIDIA and NYU present MOSAIC (Mixture Optimization via Scaling-Aware Iterative Collection). MOSAIC treats data selection not as a simple search for "diversity," but as a mixture optimization problem guided by the mathematical "scaling laws" of different data domains.

Perspective: This paper shifts the paradigm from "Maximum Diversity" (Coreset) or "Maximum Uncertainty" to "Maximum Utility Gain per Sample."

The Motivation: Why Random Sampling Fails Physical AI

In Large Language Models (LLMs), "Scaling Laws" have successfully predicted how performance improves with more tokens. However, Physical AI introduces three complications:

Conflicting Metrics: Models are evaluated on an aggregate of many scores (e.g., the 9 metrics of EPDMS).
Heterogeneous Influence: A clip of a busy intersection impacts collision avoidance more than a clip of a desert highway.
Partition Ambiguity: Unlike "Code" vs "English" in LLMs, driving scenes aren't always naturally separated into clean domains.

Current SOTA methods like Chameleon or Coreset ignore these scaling dynamics, often selecting redundant data that doesn't move the needle on the metrics that actually need improvement.

Methodology: The MOSAIC Pipeline

MOSAIC solves the data selection problem in three distinct stages, as illustrated in the architecture below:

MOSAIC Framework Overview

1. Clustering & Ranking

The data pool ( $D_{p oo l}$ ) is first partitioned into domains (e.g., by city or semantic captions). Within each cluster, samples are ranked using an Importance Score ( $I (x)$ ), defined by how much the current model struggles with that specific sample.

2. Modeling Scaling Laws

This is the "brain" of the method. The authors fit a saturating exponential scaling law for each domain: $Δ \hat{U}_{i} (n) = a_{i} (1 - e^{- n / a u_{i}})$ By running small "pilot runs," they determine $a_{i}$ (the potential total improvement a domain can offer) and $a u_{i}$ (how quickly that domain's benefit saturates).

3. Iterative Collection

Instead of picking all data at once, MOSAIC adds samples one-by-one. It selects the next sample from the domain that currently has the highest marginal gain—the steepest part of its scaling curve. As a domain's curve flattens (saturates), MOSAIC automatically pivots to another domain that still offers high growth.

Experiments: Massive Efficiency Gains

The framework was tested using the Hydra-MDP model on the OpenScene and Navtrain datasets, evaluating it on the Extended Predictive Driver Model Score (EPDMS).

SOTA Comparison

MOSAIC outperformed all baselines (Random, Uncertainty, Coreset, Chameleon) across every budget. In the OpenScene experiment, it achieved superior driving performance with 82% less data than random sampling.

Experiment Results Comparison

Visualizing the Selection Strategy

One of the most insightful findings is how MOSAIC shifts its attention over time. In OpenScene, it initially mines heavily from Boston and Singapore because those domains offer high initial gains (steep curves). Once those are exhausted, it pivots to Pittsburgh, which has a steadier, long-term improvement rate.

Scaling Curves per City

Critical Analysis & Insights

Why does it work?

MOSAIC's success stems from its ability to disentangle domain local effects. Most active learning methods try to find "globally difficult" samples. MOSAIC realizes that difficulty is domain-specific and that some metrics saturate faster than others. By modeling this saturation explicitly, it avoids wasting the compute budget on "more of the same."

Limitations

Linear Separability Assumption: The method assumes that the total utility is roughly the sum of individual domain effects. If there are massive cross-domain interactions, the model might over-allocate data.
Pilot Run Overhead: You need to train several small proxy models to fit the scaling laws. However, the authors show this cost is amortized quickly in large-scale training.

Conclusion

MOSAIC provides a principled blueprint for the "Data Era" of Autonomous Driving. It proves that by understanding the physics of scaling, we can build safer, more competent AI systems while significantly reducing the environmental and financial cost of large-scale deep learning.

Key Takeaway: Don't just collect more data—collect data that maximizes your marginal utility along the scaling curve.

发现相似论文

试试这些示例

Search for recent papers in autonomous driving or physical AI that use Neural Scaling Laws (NSLs) to optimize multi-objective or multi-metric evaluation criteria.
Which study first introduced the concept of "Data Mixture Laws" for LLM pre-training, and how does MOSAIC adapt this theory specifically for non-textual, interactive environments?
Investigate how scaling-aware data selection methods have been applied to multi-modal generative models or robotics tasks involving competing constraints like safety versus efficiency.

[CVPR 2024] MOSAIC: Boosting Autonomous Driving with Scaling-Aware Data Selection

1. Executive Summary

2. The Motivation: Why Random Sampling Fails Physical AI

3. Methodology: The MOSAIC Pipeline

3.1. 1. Clustering & Ranking

3.2. 2. Modeling Scaling Laws

3.3. 3. Iterative Collection

4. Experiments: Massive Efficiency Gains

4.1. SOTA Comparison

4.2. Visualizing the Selection Strategy

5. Critical Analysis & Insights

5.1. Why does it work?

5.2. Limitations

6. Conclusion