HumanScore is a systematic benchmarking framework developed by Stanford researchers to evaluate human motion realism in AI-generated videos. It introduces six interpretable metrics grounded in biomechanics—spanning anatomy, kinematics, and kinetics—and ranks 13 SOTA models, revealing that Seedance 1.0 Pro and HunyuanVideo 1.5 currently lead the field in physical plausibility.
TL;DR
While flagship AI models like Sora and Kling can generate breathtaking visuals, they often fail the "physicality test." HumanScore, a new benchmark from Stanford University, moves past pixel-realism to evaluate whether AI-generated humans actually move like biological beings. By extracting 3D skeletons and applying biomechanical laws, it provides a rigorous ranking of current SOTA models, showing that even the best generators still struggle with basic anatomical consistency.
The "Glitch" in the Matrix: Why Visual Realism Isn't Enough
If you look at a modern AI video of a ballerina, the lighting and textures are stunning. However, look closer at the limbs: do the bones stay the same length during a turn? Does the knee bend at an angle that would snap a human tendon?
Traditional benchmarks like VBench or FVD (Fréchet Video Distance) are "appearance-centric." They check if the pixels look right, but they don't understand that a human has a rigid skeleton and mass. HumanScore addresses this by shifting the evaluation from 2D pixels to 3D structural dynamics.
Methodology: The Three-Tier Biomechanical Hierarchy
The authors propose a structured evaluation that mimics the hierarchy of human movement science:
- Anatomical Correctness (The Foundation): Uses artifacts detectors (like HADM) to find "ghost" limbs and analyzes if bone segments remain rigid over time.
- Kinematic Correctness (The Geometry): Maps video motion to an OpenSim skeleton to check if joint angles stay within biological limits (Range of Motion) and uses 3D mesh intersection tests to detect self-collision (e.g., an arm passing through a torso).
- Kinetic Correctness (The Physics): Since we can't measure muscle force directly from video, the authors use (Newton's Second Law). They track velocity and "jerk" (the rate of change of acceleration) to penalize unnatural jitters and physics-defying speed spikes.
Fig 1: The HumanScore pipeline: From motion curation and prompt design to multifaceted biomechanical evaluation.
Exploring the Leaderboard: Who Moves Best?
The study benchmarked 13 models, including proprietary giants and open-source contenders.
- The Leaders: Seedance 1.0 Pro and HunyuanVideo 1.5 co-lead the board with an overall score of 91.1.
- The Gap: Real human videos score 94.3, highlighting a persistent gap.
- Specific Strengths: KlingAI showed the highest Kinetic and Kinematic scores, meaning its motions are physically "smoother," while HunyuanVideo 1.5 excelled in Anatomy, generating fewer "spurious limbs."
Table 1: The HumanScore Leaderboard. Note how real videos still lead, but Top-AI models are closing the gap.
Key Insights & Failure Modes
The study identified several recurring "AI-isms":
- Temporal Drift: Parts of the body (like feet) tend to "slide" or change scale when the model focuses on complex movements.
- The "Intensity" Penalty: As the motion becomes more intense (e.g., from walking to sprinting), all models experience a significant drop in score. Harder motions like ballet or parkour act as the ultimate "stress test" for generative physics.
- Prompting isn't a silver bullet: Even with highly detailed "prompt engineering" to specify full-body, static camera shots, the underlying generative engines still struggle to maintain physical constraints.
Conclusion and Future Outlook
HumanScore is a wake-up call for the generative AI community. It proves that to move from "fun creative tools" to "reliable simulation tools" for industries like sports science, medicine, or high-end filmmaking, we need physics-informed architectures.
The modular design of HumanScore allows it to evolve as better 3D pose estimators (like PromptHMR) emerge. For researchers, the goal is now clear: don't just generate a human—generate a human that obeys the laws of biology.
Note: This benchmark provides the first technical "compass" for navigating the complex intersection of human motion reconstruction and video generation.
