RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

[CoRL 2025] RoboCasa365: Scaling Robot Simulation to 365 Tasks and 2,500 Environments

总结

问题

方法

结果

要点

摘要

RoboCasa365 is a large-scale simulation benchmark for training and evaluating generalist robots in household mobile manipulation tasks. It introduces 365 diverse tasks across 2,500 kitchen environments and provides over 2,200 hours of demonstration data, achieving a new SOTA for scale and diversity in robot simulation frameworks.

TL;DR

RoboCasa365 is a massive leap in robotic simulation, offering a benchmark of 365 everyday kitchen tasks and 2,500 unique environments. By providing over 2,000 hours of demonstration data, it addresses the "data hunger" of robotic foundation models. The results show that pretraining in this diverse virtual world makes real-world robots significantly more efficient and capable, boosting real-world success rates from 61.8% to 79.8%.

Examining the "Generalist" Bottleneck

To build a "Generalist Robot," we need robots that can handle the messiness of a real kitchen—opening a fridge, making coffee, and clearing a table. However, gathering 10,000 hours of real-world human demonstrations is prohibitively expensive. Simulation is the obvious answer, but until now, most simulators were "toy-like," focusing on single tables or a handful of objects.

The authors identify three primary gaps in current research:

Task Breadth: Most benchmarks focus on atomic skills (picking a ball) rather than composite human activities (making a sandwich).
Visual Diversity: Models trained in one virtual kitchen fail when the floor color or cabinet handle changes.
Reproducibility: Real-world benchmarks are noisy and hard to compare across different labs.

Methodology: Building a Virtual Universe

RoboCasa365 scales the original RoboCasa framework through three pillars:

1. Asset & Scene Scaling

The researchers built "Digital Cousins" of 50 real-world homes sourced from Zillow, resulting in 2,500 unique kitchen configurations. They expanded the interactive furniture library to include 456 articulated instances, such as toasters, blenders, and dishwashers.

2. LLM-Guided Task Blueprints

Instead of manually coding 365 tasks, the authors used Large Language Models (LLMs) to generate "Task Blueprints." The LLM proposed 60 activities (e.g., "storing leftovers"), and for each, defined the sequence of atomic skills (navigation -> find bowl -> pick -> place in fridge) required to complete them.

3. The Data Engine

The framework provides:

612 hours of human teleoperation data.
1,615 hours of synthetic data generated via MimicGen, which takes a few human "seeds" and automatically creates thousands of variations.

Overall Architecture Figure 1: The RoboCasa365 ecosystem: Tasks, Environments, and Scaling.

Experiments and Insights

The Foundation Model Advantage

The core experiment compared training a policy from scratch versus pretraining on the RoboCasa365 dataset. The findings were striking: Pretraining provided a 3x data efficiency gain. A model pretrained on RoboCasa365 could achieve the same performance with 150 demonstrations as a non-pretrained model could with 500.

Multi-Task Performance (Zero-Shot)

The authors tested SOTA models like GR00T N1.5, π0, and Diffusion Policy. While models generally mastered "Atomic" tasks easily, "Composite-Unseen" tasks (tasks the model never saw during training) remained a significant challenge, showing that while we are moving toward generalists, long-horizon reasoning is still the frontier.

Performance Comparison Figure 2: Success rate improvements. Pretraining consistently lifts the baseline across all task types.

The "Catastrophic Forgetting" in Lifelong Learning

One of the most sobering results came from the Lifelong Learning benchmark. As the robot learned new, more complex tasks (Phase 4), its performance on simple tasks learned in Phase 1 plummeted (from 41.5% to 10.6%). This proves that "remembering" old skills while learning new ones remains a major unsolved hurdle in robotics.

Critical Analysis & Conclusion

Takeaway: RoboCasa365 proves that diversity of environment is just as important as quantity of data. Adding more scenes (from 5 to 2,500) significantly improved the robot's ability to handle new kitchens without extra training.

Limitations:

Physical Fidelity: While visually rich, MuJoCo is still a rigid-body simulator. It doesn't perfectly capture the fluid dynamics of pouring milk or the soft-body physics of sliced bread.
Domain Gap: Even with "Sim-and-Real" co-training, there is a performance drop when moving to physical hardware, though RoboCasa365 narrows this gap significantly compared to prior work.

RoboCasa365 provides the most rigorous "gym" yet for the next generation of humanoid and mobile robots. By open-sourcing the environments and the 2,000+ hours of data, the authors have essentially given the community a "Base Map" for household intelligence.

发现相似论文

试试这些示例

Search for recent papers that use synthetic data generation methods similar to MimicGen to scale robotic demonstration datasets.
Which original paper introduced the RoboCasa framework and how does the 365 version specifically improve the underlying physics or asset management?
Find studies that evaluate the transfer of mobile manipulation policies from MuJoCo-based simulations to real-world humanoid or mobile base hardware.

[CoRL 2025] RoboCasa365: Scaling Robot Simulation to 365 Tasks and 2,500 Environments

1. TL;DR

2. Examining the "Generalist" Bottleneck

3. Methodology: Building a Virtual Universe

3.1. 1. Asset & Scene Scaling

3.2. 2. LLM-Guided Task Blueprints

3.3. 3. The Data Engine

4. Experiments and Insights

4.1. The Foundation Model Advantage

4.2. Multi-Task Performance (Zero-Shot)

4.3. The "Catastrophic Forgetting" in Lifelong Learning

5. Critical Analysis & Conclusion