WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[CVPR 2025] CLAIMS: Breaking the Difficulty Ceiling in Humanoid Control via Iterative Closed-Loop Synthesis
总结
问题
方法
结果
要点
摘要

This paper introduces CLAIMS, a closed-loop automated framework for humanoid motion synthesis and control. By iteratively co-evolving a motion diffusion model (MDM) and an RL-based tracker through LLM-driven feedback, it achieves a 45% reduction in failure rates on high-difficulty benchmarks using only 1/10 of the standard AMASS dataset size.

TL;DR

Existing humanoid controllers often "collapse" when faced with gymnastics or martial arts because they are trained on "lazy" datasets like AMASS (mostly walking and daily tasks). CLAIMS (Closed-Loop Automated Iterative Motion Synthesis) solves this by using an LLM to play a "game" with the controller: the LLM generates harder and harder professional motion prompts, the controller tries to learn them, and the failures feed back into the next round of synthesis.

The result? A controller that masters acrobatic flips and kung-fu, achieving a 45% failure rate reduction while being trained on 90% less data than standard benchmarks.


1. The "Static Dataset" Bottleneck

Why can't our simulated humanoids do a double backflip? The answer isn't just the RL algorithm—it's the data.

  1. High Cost: Professional MoCap for acrobatics requires expensive suits and elite athletes.
  2. Difficulty Imbalance: Over 90% of AMASS consists of low-dynamic motions.
  3. Distribution Gap: Controllers trained on "stable walking" cannot generalize to "explosive jumping" because the physical transitions (high acceleration/torque) are absent from the training manifold.

2. Methodology: From Static Data to Competitive Co-evolution

The core innovation of CLAIMS is the Competitive Iterative Loop. Instead of a fixed dataset, the training distribution is a moving target that stays just ahead of the controller's current skill level.

A. The Professional Taxonomy

The authors don't just ask an LLM for "hard motions." They define a 4-axis Difficulty Space:

  • Base Action: Atomic skills (Kick, Leap).
  • Combo Action: Composition logic (Roll → Rise → Leap).
  • Detail: Technical nuance (Precise foot placement).
  • Speed & Rhythm: Burstiness and tempo.

B. The Closed-Loop Architecture

System Architecture

The loop consists of four key stages:

  1. Generation: Using a Motion Diffusion Model (MDM) to synthesize motions from expert-templated prompts.
  2. Filtering: A Vision-Language Model (VLM) checks if the motion matches the text, while physics filters remove "glitchy" motions (sinking/floating).
  3. Training: The tracker (e.g., PHC) learns to imitate the new synthetic motions.
  4. Feedback: An LLM (Gemini CoT) receives tracking metrics (MPJPE) and VLM descriptions of the failures to generate the next, harder batch of prompts.

3. Pushing the Manifold: Does it actually work?

One might ask: If the MDM was trained on AMASS, how can it generate motions harder than AMASS? The authors found a fascinating insight: Compositional Extrapolation. By combining learned primitives in novel ways through expert prompting, the MDM's latent space can produce motions that lie entirely outside its original training manifold.

t-SNE Visualization Table 1: The success rate climbs steadily from Loop 0 to Loop 6, eventually crushing the AMASS baseline.

Experimental "War Stories"

  • AIST++ (Dance): Success rate jumped from 67.6% (Baseline) to 88.1% (L6).
  • Kungfu: Success rate rose from 47.1% to 60.3%.
  • Efficiency: The model achieved these wins with a fraction of the data, proving that curated difficulty beats raw scale.

4. Visual Evidence: Mastery vs. Collapse

The qualitative difference is striking. When faced with a "Jump Snap Kick," the baseline PHC model (trained on AMASS) loses balance almost immediately as the center of mass shifts too rapidly.

Qualitative Comparison Figure: The L6 tracker (Green) maintains stable air-control during an acrobatic flip, while the baseline (Red) collapses during the momentum shift.


5. Critical Analysis & Future Outlook

Why it works: CLAIMS acts as a "Physical Curriculum." By starting with simple motions and progressively increasing the "Speed & Rhythm" and "Combo Complexity," the RL agent discovers stable recovery strategies for high-torque states that it would otherwise never explore.

Limitations:

  • Synthesis Ceiling: If the MDM can't visualize a motion, the controller can't learn it. The framework is limited by the "imagination" of the generative model.
  • Manual Taxonomy: The variable library (5 domains) still requires human expertise to set up.

Conclusion: CLAIMS provides a blueprint for the future of humanoid robotics. We don't need million-dollar MoCap studios; we need smarter iteration. By letting LLMs and Physics-based VLMs curate the training "syllabus," we can finally teach robots the agility of human athletes.

发现相似论文

试试这些示例

  • Search for recent papers using LLMs or VLMs as "automated curriculum schedulers" for reinforcement learning in robotics.
  • Which original paper proposed the PHC (Perpetual Humanoid Control) framework, and how does this work specifically modify its training pipeline?
  • Explore studies that use Motion Diffusion Models (MDM) to generate synthetic datasets for downstream tasks beyond character animation, such as sim-to-real transfer for bipedal robots.
目录
[CVPR 2025] CLAIMS: Breaking the Difficulty Ceiling in Humanoid Control via Iterative Closed-Loop Synthesis
1. TL;DR
2. 1. The "Static Dataset" Bottleneck
3. 2. Methodology: From Static Data to Competitive Co-evolution
3.1. A. The Professional Taxonomy
3.2. B. The Closed-Loop Architecture
4. 3. Pushing the Manifold: Does it actually work?
4.1. Experimental "War Stories"
5. 4. Visual Evidence: Mastery vs. Collapse
6. 5. Critical Analysis & Future Outlook