This paper introduces CLAIMS, a closed-loop automated framework for humanoid motion synthesis and control. By iteratively co-evolving a motion diffusion model (MDM) and an RL-based tracker through LLM-driven feedback, it achieves a 45% reduction in failure rates on high-difficulty benchmarks using only 1/10 of the standard AMASS dataset size.
TL;DR
Existing humanoid controllers often "collapse" when faced with gymnastics or martial arts because they are trained on "lazy" datasets like AMASS (mostly walking and daily tasks). CLAIMS (Closed-Loop Automated Iterative Motion Synthesis) solves this by using an LLM to play a "game" with the controller: the LLM generates harder and harder professional motion prompts, the controller tries to learn them, and the failures feed back into the next round of synthesis.
The result? A controller that masters acrobatic flips and kung-fu, achieving a 45% failure rate reduction while being trained on 90% less data than standard benchmarks.
1. The "Static Dataset" Bottleneck
Why can't our simulated humanoids do a double backflip? The answer isn't just the RL algorithm—it's the data.
- High Cost: Professional MoCap for acrobatics requires expensive suits and elite athletes.
- Difficulty Imbalance: Over 90% of AMASS consists of low-dynamic motions.
- Distribution Gap: Controllers trained on "stable walking" cannot generalize to "explosive jumping" because the physical transitions (high acceleration/torque) are absent from the training manifold.
2. Methodology: From Static Data to Competitive Co-evolution
The core innovation of CLAIMS is the Competitive Iterative Loop. Instead of a fixed dataset, the training distribution is a moving target that stays just ahead of the controller's current skill level.
A. The Professional Taxonomy
The authors don't just ask an LLM for "hard motions." They define a 4-axis Difficulty Space:
- Base Action: Atomic skills (Kick, Leap).
- Combo Action: Composition logic (Roll → Rise → Leap).
- Detail: Technical nuance (Precise foot placement).
- Speed & Rhythm: Burstiness and tempo.
B. The Closed-Loop Architecture

The loop consists of four key stages:
- Generation: Using a Motion Diffusion Model (MDM) to synthesize motions from expert-templated prompts.
- Filtering: A Vision-Language Model (VLM) checks if the motion matches the text, while physics filters remove "glitchy" motions (sinking/floating).
- Training: The tracker (e.g., PHC) learns to imitate the new synthetic motions.
- Feedback: An LLM (Gemini CoT) receives tracking metrics (MPJPE) and VLM descriptions of the failures to generate the next, harder batch of prompts.
3. Pushing the Manifold: Does it actually work?
One might ask: If the MDM was trained on AMASS, how can it generate motions harder than AMASS? The authors found a fascinating insight: Compositional Extrapolation. By combining learned primitives in novel ways through expert prompting, the MDM's latent space can produce motions that lie entirely outside its original training manifold.
Table 1: The success rate climbs steadily from Loop 0 to Loop 6, eventually crushing the AMASS baseline.
Experimental "War Stories"
- AIST++ (Dance): Success rate jumped from 67.6% (Baseline) to 88.1% (L6).
- Kungfu: Success rate rose from 47.1% to 60.3%.
- Efficiency: The model achieved these wins with a fraction of the data, proving that curated difficulty beats raw scale.
4. Visual Evidence: Mastery vs. Collapse
The qualitative difference is striking. When faced with a "Jump Snap Kick," the baseline PHC model (trained on AMASS) loses balance almost immediately as the center of mass shifts too rapidly.
Figure: The L6 tracker (Green) maintains stable air-control during an acrobatic flip, while the baseline (Red) collapses during the momentum shift.
5. Critical Analysis & Future Outlook
Why it works: CLAIMS acts as a "Physical Curriculum." By starting with simple motions and progressively increasing the "Speed & Rhythm" and "Combo Complexity," the RL agent discovers stable recovery strategies for high-torque states that it would otherwise never explore.
Limitations:
- Synthesis Ceiling: If the MDM can't visualize a motion, the controller can't learn it. The framework is limited by the "imagination" of the generative model.
- Manual Taxonomy: The variable library (5 domains) still requires human expertise to set up.
Conclusion: CLAIMS provides a blueprint for the future of humanoid robotics. We don't need million-dollar MoCap studios; we need smarter iteration. By letting LLMs and Physics-based VLMs curate the training "syllabus," we can finally teach robots the agility of human athletes.
