WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2025] PAM: A Decoupled Engine for Sim-to-Real Hand-Object Interaction Video Generation
总结
问题
方法
结果
要点
摘要

PAM (Pose–Appearance–Motion) is a unified engine for high-fidelity hand-object interaction (HOI) video generation. It utilizes a three-stage diffusion pipeline to convert sparse pose keyframes and object geometry into photorealistic videos, achieving SOTA performance on DexYCB (FVD 29.13) and OAKINK2 benchmarks.

TL;DR

PAM (Pose–Appearance–Motion) is a state-of-the-art framework designed to generate realistic Hand-Object Interaction (HOI) videos starting from nothing more than initial/target poses and object geometry. By decoupling the generation into pose trajectories, a synthesized reference frame, and temporal motion, PAM eliminates the need for ground-truth real-world images, enabling a seamless transition from simulation to photorealistic data.

The "First-Frame" Bottleneck in HOI

The current landscape of HOI synthesis is caught between three suboptimal paradigms. You either have skeletal animation with no pixels (MANO trajectories), static image hallucinations that fail when they move, or video models that are "cheating" by requiring a real ground-truth first frame as an input.

The authors argue that for Sim-to-Real to work, we need an engine that can imagine the appearance from scratch based only on geometric constraints. The goal is to take a robotic simulator's output and turn it into a video that looks like it was shot in a real kitchen.

Methodology: The Trimodal Diffusion Pipeline

The core innovation of PAM lies in its three-stage decoupled architecture. Instead of trying to learn pose and appearance simultaneously, which often leads to "motion blur" or "mutated hands," PAM breaks the task down:

  1. Pose Generation: Uses a pretrained model (GraspXL) to interpolate a physically plausible trajectory between two keyframes.
  2. Appearance Generation: This is the "Vision" stage. Based on the skeletal pose, the model uses a controllable image diffusion model (Flux) to generate a high-quality reference frame. To ensure the hand looks human, they use Trimodal Conditioning: Depth, Segmentation, and Hand Keypoints.
  3. Motion Generation: Acts as the "Animator". A video diffusion model (CogVideoX) takes the first frame and the rest of the generated poses to synthesize the full video clip.

PAM Overall Architecture

Why it Works: The Physics of Intuition

Why bother with three different conditions? The authors' ablation study reveals a crucial insight:

  • Segmentation masks give the global shape but miss finger counts.
  • Depth maps provide geometry but often fail at fine-grained joint articulation.
  • Hand keypoints ensure the internal skeleton is correct but lack surface context.

By fusing these, PAM achieves a level of "Geometric Fidelity" that previous models like CosHand or InterDyn lacked.

Performance & Downstream Utility

PAM doesn't just produce "pretty" videos; it produces accurate ones. On the DexYCB benchmark, PAM achieves an FVD (Fréchet Video Distance) of 29.13, smashing the previous SOTA of 38.83. More importantly, the MPJPE (Mean Per-Joint Position Error) is reduced to 19.37mm, meaning the pixels actually land where the joints are supposed to be.

Quantitative Performance Comparison

The Cold Start Solution

The ultimate test for any generative model in AI is: "Does this data help train other models?" Researchers used PAM to generate 3,400 synthetic videos to train a hand pose estimator (SimpleHand). The result was startling: A model trained on 50% real data + PAM synthetic data performed as well as a model trained on 100% real data. This effectively cuts the requirement for expensive human labeling in half.

Critical Analysis & Future Outlook

While PAM is a breakthrough, it isn't perfect. The researchers noted that error propagation remains a challenge. If the Stage-I pose generation creates an "interpenetrating" grasp (fingers passing through the object), the video model will faithfully render that physical impossibility with photorealistic textures.

Furthermore, the pipeline is computationally heavy. Generating a 40-frame video takes roughly 300 seconds on an H20 GPU. Future iterations will likely look toward unifying these stages into a single, faster end-to-end transformer.

Conclusion

PAM proves that the future of HOI research isn't just about better datasets—it's about better engines. By decoupling motion and appearance, PAM provides a scalable way to bridge the gap between robotic simulators and the real world.

Qualitative Comparison

发现相似论文

试试这些示例

  • Search for recent papers that use trimodal conditioning (depth, semantic, and keypoints) for controllable human-centric video generation.
  • Who originally proposed the MANO hand model and how have recent diffusion-based methods modified its integration into 3D-aware image synthesis?
  • Explore latest research that evaluates the scaling laws of synthetic video data for improving downstream 3D hand pose estimation tasks.
目录
[CVPR 2025] PAM: A Decoupled Engine for Sim-to-Real Hand-Object Interaction Video Generation
1. TL;DR
2. The "First-Frame" Bottleneck in HOI
3. Methodology: The Trimodal Diffusion Pipeline
4. Why it Works: The Physics of Intuition
5. Performance & Downstream Utility
5.1. The Cold Start Solution
6. Critical Analysis & Future Outlook
7. Conclusion