RoboCasa365 is a large-scale simulation benchmark for training and evaluating generalist robots in household mobile manipulation tasks. It introduces 365 diverse tasks across 2,500 kitchen environments and provides over 2,200 hours of demonstration data, achieving a new SOTA for scale and diversity in robot simulation frameworks.
TL;DR
RoboCasa365 is a massive leap in robotic simulation, offering a benchmark of 365 everyday kitchen tasks and 2,500 unique environments. By providing over 2,000 hours of demonstration data, it addresses the "data hunger" of robotic foundation models. The results show that pretraining in this diverse virtual world makes real-world robots significantly more efficient and capable, boosting real-world success rates from 61.8% to 79.8%.
Examining the "Generalist" Bottleneck
To build a "Generalist Robot," we need robots that can handle the messiness of a real kitchen—opening a fridge, making coffee, and clearing a table. However, gathering 10,000 hours of real-world human demonstrations is prohibitively expensive. Simulation is the obvious answer, but until now, most simulators were "toy-like," focusing on single tables or a handful of objects.
The authors identify three primary gaps in current research:
- Task Breadth: Most benchmarks focus on atomic skills (picking a ball) rather than composite human activities (making a sandwich).
- Visual Diversity: Models trained in one virtual kitchen fail when the floor color or cabinet handle changes.
- Reproducibility: Real-world benchmarks are noisy and hard to compare across different labs.
Methodology: Building a Virtual Universe
RoboCasa365 scales the original RoboCasa framework through three pillars:
1. Asset & Scene Scaling
The researchers built "Digital Cousins" of 50 real-world homes sourced from Zillow, resulting in 2,500 unique kitchen configurations. They expanded the interactive furniture library to include 456 articulated instances, such as toasters, blenders, and dishwashers.
2. LLM-Guided Task Blueprints
Instead of manually coding 365 tasks, the authors used Large Language Models (LLMs) to generate "Task Blueprints." The LLM proposed 60 activities (e.g., "storing leftovers"), and for each, defined the sequence of atomic skills (navigation -> find bowl -> pick -> place in fridge) required to complete them.
3. The Data Engine
The framework provides:
- 612 hours of human teleoperation data.
- 1,615 hours of synthetic data generated via MimicGen, which takes a few human "seeds" and automatically creates thousands of variations.
Figure 1: The RoboCasa365 ecosystem: Tasks, Environments, and Scaling.
Experiments and Insights
The Foundation Model Advantage
The core experiment compared training a policy from scratch versus pretraining on the RoboCasa365 dataset. The findings were striking: Pretraining provided a 3x data efficiency gain. A model pretrained on RoboCasa365 could achieve the same performance with 150 demonstrations as a non-pretrained model could with 500.
Multi-Task Performance (Zero-Shot)
The authors tested SOTA models like GR00T N1.5, π0, and Diffusion Policy. While models generally mastered "Atomic" tasks easily, "Composite-Unseen" tasks (tasks the model never saw during training) remained a significant challenge, showing that while we are moving toward generalists, long-horizon reasoning is still the frontier.
Figure 2: Success rate improvements. Pretraining consistently lifts the baseline across all task types.
The "Catastrophic Forgetting" in Lifelong Learning
One of the most sobering results came from the Lifelong Learning benchmark. As the robot learned new, more complex tasks (Phase 4), its performance on simple tasks learned in Phase 1 plummeted (from 41.5% to 10.6%). This proves that "remembering" old skills while learning new ones remains a major unsolved hurdle in robotics.
Critical Analysis & Conclusion
Takeaway: RoboCasa365 proves that diversity of environment is just as important as quantity of data. Adding more scenes (from 5 to 2,500) significantly improved the robot's ability to handle new kitchens without extra training.
Limitations:
- Physical Fidelity: While visually rich, MuJoCo is still a rigid-body simulator. It doesn't perfectly capture the fluid dynamics of pouring milk or the soft-body physics of sliced bread.
- Domain Gap: Even with "Sim-and-Real" co-training, there is a performance drop when moving to physical hardware, though RoboCasa365 narrows this gap significantly compared to prior work.
RoboCasa365 provides the most rigorous "gym" yet for the next generation of humanoid and mobile robots. By open-sourcing the environments and the 2,000+ hours of data, the authors have essentially given the community a "Base Map" for household intelligence.
