$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Ψ0: Decoupled Foundation Models for Universal Humanoid Loco-Manipulation

总结

问题

方法

结果

要点

摘要

Ψ0 (Psi-Zero) is an open foundation model for universal humanoid loco-manipulation, employing a staged training paradigm that combines a 2.5B parameter Vision-Language Model (VLM) with a flow-based Multi-Modal Diffusion Transformer (MM-DiT). It achieves SOTA performance on complex, long-horizon tasks, outperforming existing baselines by over 40% in success rate while using significantly less robot-specific data.

TL;DR

Researchers from the USC Physical Superintelligence (PSI) Lab, NVIDIA, and WorldEngine have unveiled Ψ0 (Psi-Zero), an open-source foundation model designed to master the "holy grail" of robotics: loco-manipulation. By decoupling semantic understanding from physical execution, Ψ0 achieves superior performance on long-horizon tasks (like filling bottles or pushing carts) using 10x less data than current industry leaders like GR00T-N1.6.

The "Co-training" Trap in Humanoid Robotics

In recent years, the trend in robotics has been "more is better"—throw massive amounts of human videos and robot teleoperation data into a single Transformer and hope general intelligence emerges. However, the authors of Ψ0 argue that this strategy is fundamentally flawed. Humans and humanoid robots have different joint limits, speeds, and "action frequencies." Forcing a single model to internalize both distributions simultaneously leads to an "embodiment gap" that results in jittery, imprecise motions.

Methodology: The Power of Decoupling

Instead of a monolithic network, Ψ0 uses a Triple-System Architecture:

System-2 (VLM Backbone): A Qwen3-VL-2B model pre-trained on 800+ hours of EgoDex (egocentric human video). Its job is "Next-Action Prediction"—understanding what needs to happen next at a high level (e.g., "now reach for the faucet").
System-1 (Action Expert): A 500M parameter Multi-Modal Diffusion Transformer (MM-DiT). This is the "muscle." It takes the visual features from the VLM and maps them directly to 36-DoF joint-space actions.
System-0 (Lower-Body Controller): A specialized RL-based tracking policy (AMO) that ensures the robot stays balanced while the upper body performs complex tasks.

Model Architecture Fig 1: The Ψ0 high-level architecture separates semantic pre-training from joint-level action expertise.

Real-Time Action Chunking (RTC)

A common problem with large VLAs is the "stop-and-think" behavior—the robot pauses for 200ms while the model runs inference. Ψ0 solves this with Training-time RTC. By masking the first few steps of an action chunk during training, the model learns to "inpaint" future actions that are perfectly continuous with the actions currently being executed, resulting in buttery-smooth motion.

Experiments: Superior Data Efficiency

The team tested Ψ0 against heavyweights like π0.5, GR00T-N1.6, and H-RDT across eight diverse tasks, including:

Removing a lid and filling a bottle at a faucet.
Wiping a bowl and stacking it.
Pushing a grocery cart and placing items.

Real-World Task Setup Fig 2: Diverse evaluation scenarios including dual-arm coordination and locomotion.

Key Results:

Success Rate: Ψ0 achieved an average overall success rate over 40% higher than the nearest baseline.
Data Efficiency: It reached this level using only 30 hours of real robot data, whereas other models require hundreds or thousands of hours.
Ablation: The study confirmed that pre-training on human video (EgoDex) was the "secret sauce"—without it, success rates dropped from 8/10 to 4/10 on complex tasks.

Experimental Results Fig 3: Quantifiable performance gap between Ψ0 and existing SOTA baselines.

Critical Insight: Quality Over Quantity

The fundamental takeaway of Ψ0 is that for humanoid robots, scaling the right data in the right way matters more than raw volume. By using high-quality egocentric data for visual representation and narrow but high-precision robot data for motor control, we can bridge the embodiment gap more effectively than with massive, noisy cross-embodiment datasets.

Conclusion & Future Work

Ψ0 represents a significant step toward "Visible Physical Superintelligence." By open-sourcing the entire pipeline—from teleoperation tools to the 2.5B parameter model—the PSI Lab is providing the community with a robust foundation for humanoid research. Future iterations will likely look into scaling to even larger video datasets and improving the hardware payload capacity.

发现相似论文

试试这些示例

Search for recent papers that utilize training-time action chunking or inpainting to solve inference latency in large-scale Vision-Language-Action (VLA) models.
Which original research introduced the Multi-Modal Diffusion Transformer (MM-DiT) architecture for generative tasks, and how has it been adapted specifically for robotic joint-space control?
Investigate comparative studies on the "embodiment gap" between human egocentric video data and humanoid robot joint-space mapping in the context of foundation models.

Ψ0: Decoupled Foundation Models for Universal Humanoid Loco-Manipulation

1. TL;DR

2. The "Co-training" Trap in Humanoid Robotics

3. Methodology: The Power of Decoupling

3.1. Real-Time Action Chunking (RTC)

4. Experiments: Superior Data Efficiency

4.1. Key Results:

5. Critical Insight: Quality Over Quantity

6. Conclusion & Future Work