Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control

[CVPR 2025] CLAIMS: Breaking the Difficulty Ceiling in Humanoid Control via Iterative Closed-Loop Synthesis

总结

问题

方法

结果

要点

摘要

This paper introduces CLAIMS, a closed-loop automated framework for humanoid motion synthesis and control. By iteratively co-evolving a motion diffusion model (MDM) and an RL-based tracker through LLM-driven feedback, it achieves a 45% reduction in failure rates on high-difficulty benchmarks using only 1/10 of the standard AMASS dataset size.

TL;DR

Existing humanoid controllers often "collapse" when faced with gymnastics or martial arts because they are trained on "lazy" datasets like AMASS (mostly walking and daily tasks). CLAIMS (Closed-Loop Automated Iterative Motion Synthesis) solves this by using an LLM to play a "game" with the controller: the LLM generates harder and harder professional motion prompts, the controller tries to learn them, and the failures feed back into the next round of synthesis.

The result? A controller that masters acrobatic flips and kung-fu, achieving a 45% failure rate reduction while being trained on 90% less data than standard benchmarks.

1. The "Static Dataset" Bottleneck

Why can't our simulated humanoids do a double backflip? The answer isn't just the RL algorithm—it's the data.

High Cost: Professional MoCap for acrobatics requires expensive suits and elite athletes.
Difficulty Imbalance: Over 90% of AMASS consists of low-dynamic motions.
Distribution Gap: Controllers trained on "stable walking" cannot generalize to "explosive jumping" because the physical transitions (high acceleration/torque) are absent from the training manifold.

2. Methodology: From Static Data to Competitive Co-evolution

The core innovation of CLAIMS is the Competitive Iterative Loop. Instead of a fixed dataset, the training distribution is a moving target that stays just ahead of the controller's current skill level.

A. The Professional Taxonomy

The authors don't just ask an LLM for "hard motions." They define a 4-axis Difficulty Space:

Base Action: Atomic skills (Kick, Leap).
Combo Action: Composition logic (Roll → Rise → Leap).
Detail: Technical nuance (Precise foot placement).
Speed & Rhythm: Burstiness and tempo.

B. The Closed-Loop Architecture

System Architecture

The loop consists of four key stages:

Generation: Using a Motion Diffusion Model (MDM) to synthesize motions from expert-templated prompts.
Filtering: A Vision-Language Model (VLM) checks if the motion matches the text, while physics filters remove "glitchy" motions (sinking/floating).
Training: The tracker (e.g., PHC) learns to imitate the new synthetic motions.
Feedback: An LLM (Gemini CoT) receives tracking metrics (MPJPE) and VLM descriptions of the failures to generate the next, harder batch of prompts.

3. Pushing the Manifold: Does it actually work?

One might ask: If the MDM was trained on AMASS, how can it generate motions harder than AMASS? The authors found a fascinating insight: Compositional Extrapolation. By combining learned primitives in novel ways through expert prompting, the MDM's latent space can produce motions that lie entirely outside its original training manifold.

t-SNE Visualization Table 1: The success rate climbs steadily from Loop 0 to Loop 6, eventually crushing the AMASS baseline.

Experimental "War Stories"

AIST++ (Dance): Success rate jumped from 67.6% (Baseline) to 88.1% (L6).
Kungfu: Success rate rose from 47.1% to 60.3%.
Efficiency: The model achieved these wins with a fraction of the data, proving that curated difficulty beats raw scale.

4. Visual Evidence: Mastery vs. Collapse

The qualitative difference is striking. When faced with a "Jump Snap Kick," the baseline PHC model (trained on AMASS) loses balance almost immediately as the center of mass shifts too rapidly.

Qualitative Comparison Figure: The L6 tracker (Green) maintains stable air-control during an acrobatic flip, while the baseline (Red) collapses during the momentum shift.

5. Critical Analysis & Future Outlook

Why it works: CLAIMS acts as a "Physical Curriculum." By starting with simple motions and progressively increasing the "Speed & Rhythm" and "Combo Complexity," the RL agent discovers stable recovery strategies for high-torque states that it would otherwise never explore.

Limitations:

Synthesis Ceiling: If the MDM can't visualize a motion, the controller can't learn it. The framework is limited by the "imagination" of the generative model.
Manual Taxonomy: The variable library (5 domains) still requires human expertise to set up.

Conclusion: CLAIMS provides a blueprint for the future of humanoid robotics. We don't need million-dollar MoCap studios; we need smarter iteration. By letting LLMs and Physics-based VLMs curate the training "syllabus," we can finally teach robots the agility of human athletes.

发现相似论文

试试这些示例

Search for recent papers using LLMs or VLMs as "automated curriculum schedulers" for reinforcement learning in robotics.
Which original paper proposed the PHC (Perpetual Humanoid Control) framework, and how does this work specifically modify its training pipeline?
Explore studies that use Motion Diffusion Models (MDM) to generate synthetic datasets for downstream tasks beyond character animation, such as sim-to-real transfer for bipedal robots.

[CVPR 2025] CLAIMS: Breaking the Difficulty Ceiling in Humanoid Control via Iterative Closed-Loop Synthesis

1. TL;DR

2. 1. The "Static Dataset" Bottleneck

3. 2. Methodology: From Static Data to Competitive Co-evolution

3.1. A. The Professional Taxonomy

3.2. B. The Closed-Loop Architecture

4. 3. Pushing the Manifold: Does it actually work?

4.1. Experimental "War Stories"

5. 4. Visual Evidence: Mastery vs. Collapse

6. 5. Critical Analysis & Future Outlook