WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[arXiv 2026] AnyHand: Scaling Synthetic Data to Master 3D Hand Pose Estimation
总结
问题
方法
结果
要点
摘要

AnyHand is a large-scale synthetic RGB-D dataset containing 2.5M single-hand and 4.1M hand-object interaction images designed for 3D hand pose estimation. By co-training with this data, existing SOTA models like HaMeR and WiLoR achieve significant performance gains, including a 7.6% error reduction on FreiHAND and a 41.7% reduction in RGB-D tracking error on HO-3D.

TL;DR

AnyHand is a massive new synthetic dataset (6.6M+ images) that addresses the critical data bottleneck in 3D hand pose estimation. By bridging the gap between "perfect" synthetic labels and "messy" real-world visuals, it allows standard models like HaMeR to outperform specialized SOTA architectures. It also introduces a robust RGB-D fusion method that slashes depth-based hand tracking errors by over 40%.

Background: The Diversity Bottleneck

In the race to build "Foundation Models" for human-centric AI, hand pose estimation has hit a wall. While Transformers have proven they can handle the complexity, they are data-hungry. Real-world datasets like FreiHAND or HO-3D are small and often have noisy 3D labels due to the difficulty of annotating small, self-occluding joints.

The industry's "Sim-to-Real" efforts previously failed because synthetic hands looked like plastic, lacked arms, or didn't interact realistically with objects. AnyHand changes this by treating simulation as a first-class citizen for data generation.

Methodology: The Simulation-Native Advantage

The authors didn't just render random hands; they built a pipeline focused on geometric grounding and visual diversity.

1. High-Fidelity Synthesis

Instead of simple heuristic sampling, AnyHand uses DPoser-Hand, a diffusion-based prior, to ensure poses are anatomically plausible yet diverse. It incorporates 10,240 high-frequency textures and, crucially, adds forearm context (skin and clothing), which helps the model localize the wrist more accurately.

2. RGB-D Fusion Architecture

To leverage depth, the authors proposed a lightweight bidirectional cross-attention module. This allows a standard Vision Transformer (ViT) to "attend" to geometric cues (like finger depth differences) while processing RGB features.

AnyHand Architecture and Qualitative Examples Figure 1: Examples from AnyHand-Single and AnyHand-Interact showing diverse backgrounds, arm styles, and complex object interactions.

Experiments: Data Over Architecture

One of the most striking findings is that adding AnyHand data to an existing model (HaMeR) yields better results than switching to a more "advanced" architecture (WiLoR) trained on standard data.

Key Results:

  • FreiHAND (RGB): PA-MPJPE dropped by 7.6% just by adding the synthetic data.
  • HO-3D (RGB-D): The model reached an STA-MPJPE of 1.09 cm, a massive improvement over prior specialized RGB-D works.
  • Zero-Shot Generalization: Models trained with AnyHand generalized significantly better to the HO-Cap dataset—an entirely unseen domain—proving that synthetic diversity builds robust features.

Experimental Performance Comparison Table 1: Performance on HO-3D v2. Note the massive gap between AnyHandNet-D and prior methods like Keypoint-Fusion.

Why It Works: The "Implicit Depth" Perspective

A significant insight from the paper is that purely RGB pipelines are "ill-posed" in depth. They can align 2D points perfectly but fail on 3D scale. AnyHand provides a consistent bridge. By including aligned depth in training, even the RGB-only variants learn to better infer the underlying 3D structure.

The researchers even found that using estimated depth (from MoGe-v2) during inference sometimes outperformed ground-truth depth maps because the estimated depth was smoother and more consistent with the synthetic training distribution.

Critical Analysis & Conclusion

While AnyHand shows the power of scaling, it isn't a silver bullet. The authors admit that training on synthetic data alone is insufficient; the best results always come from a "Sim+Real" co-training recipe.

Takeaway: For practitioners in VR/AR and Robotics, this paper signals that the focus should shift from tweaking Transformer blocks to building sophisticated, physics-aware simulation pipelines. AnyHand sets a new bar for what a "foundation" dataset for hands should look like.

Limitations

  • The background depth for synthetic scenes is estimated via MoGe-2 rather than true GT, leading to slight inaccuracies at boundaries.
  • The optimal mixing ratio between sim and real data remains an empirical "black art" that varies by task.

For more details, visit the project page: https://chen-si-cs.github.io/projects/AnyHand/

发现相似论文

试试这些示例

  • Search for recent papers that utilize diffusion-based priors for 3D hand pose synthesis and their impact on reducing the sim-to-real gap.
  • Which paper first introduced the MANO hand model, and how does the DPoser-Hand model improve upon its original pose space constraints?
  • Investigate how lightweight cross-attention modules have been applied to other RGB-D human-centric tasks like whole-body pose estimation or garment reconstruction.
目录
[arXiv 2026] AnyHand: Scaling Synthetic Data to Master 3D Hand Pose Estimation
1. TL;DR
2. Background: The Diversity Bottleneck
3. Methodology: The Simulation-Native Advantage
3.1. 1. High-Fidelity Synthesis
3.2. 2. RGB-D Fusion Architecture
4. Experiments: Data Over Architecture
4.1. Key Results:
5. Why It Works: The "Implicit Depth" Perspective
6. Critical Analysis & Conclusion
6.1. Limitations