WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[CVPR 2025] HGGT: Breaking the Calibration Barrier in 3D Hand Mesh Reconstruction
Summary
Problem
Method
Results
Takeaways
Abstract

HGGT (Hand Geometry Grounding Transformer) is a novel feed-forward framework for 3D hand mesh reconstruction from uncalibrated multi-view images. It achieves state-of-the-art performance across multiple benchmarks (e.g., HO3D, DexYCB) by jointly estimating camera poses and hand geometry without requiring prior calibration.

TL;DR

Hand Geometry Grounding Transformer (HGGT) is the first feed-forward framework that tackles 3D hand mesh reconstruction from uncalibrated multi-view images. By reformulating the problem as a visual-geometry grounding task, it eliminates the need for fixed camera rigs. It achieves SOTA results, often outperforming methods that rely on ground-truth camera parameters, while maintaining robustness against motion blur and occlusion.

Background Positioning: This work sits at the intersection of 3D Foundation Models (like DUSt3R/VGGT) and Articulated Mesh Recovery. It transitions hand reconstruction from a "constrained laboratory task" to a "flexible, in-the-wild utility."


The "Calibration Dilemma" in Hand Reconstruction

Why is 3D hand reconstruction so hard?

  1. Single-View (Standard): Easy to deploy but biologically limited. One camera cannot "see" through the palm or resolve the absolute depth of the wrist.
  2. Multi-View (Classical): Resolves ambiguity but requires Camera Calibration. You need to know exactly where every camera is in 3D space. This makes it impossible to just use two random smartphones to capture a gesture.
  3. Foundation Models (Scene-level): Models like VGGT look at the entire scene. Because hands are small and often move differently than the background, these models tend to ignore the hand's fine geometry in favor of the static room environment.

Methodology: The Unified Grounding Architecture

The core innovation of HGGT is its Unified Cross-Attention Refinement Module. Instead of treating camera estimation and hand reconstruction as separate sequential steps, HGGT solves them jointly.

1. Unified Tokens

The model uses two sets of tokens:

  • Camera Tokens: To learn the relative extrinsics and intrinsics.
  • Hand Tokens: To represent the global 3D geometry of the MANO hand model.

2. Multi-View Feature Aggregation

Based on a VGGT backbone, the model uses "Alternating Attention" to extract features across patches and frames. The refined tokens then query these features via cross-attention to "ground" the geometry in actual pixel data.

Model Architecture Fig. 2: The HGGT Pipeline. Note how hand and camera tokens are refined iteratively through multi-view context.


The Power of Mixed Data

To train a model this flexible, the authors created a massive hybrid dataset:

  • In-the-wild Monocular: For lighting and background variety.
  • Real Multi-view: For high-quality 3D hand annotations.
  • Synthetic Multi-view (New!): By using GraspXL and Objaverse, they rendered 85k sequences with randomized camera views. This teaches the model geometric invariance—the hand is the same hand regardless of where the camera sits.

Experimental Excellence

HGGT significantly outperforms existing baselines when the cameras are not pre-calibrated.

SOTA Comparison

As shown in the table below, when POEM (the previous SOTA) is forced to use predicted cameras, its performance drops significantly. HGGT (Ours), however, maintains high precision.

SOTA Performance Table 1: Quantitative results across DexYCB, OakInk, and HO3D datasets.

Robustness under Pressure

One of the most impressive feats of HGGT is its stability. Even when 6 out of 8 views are occluded by an object, or when the hand is a "blur" due to fast motion, the learned priors allow the model to hallucinate a plausible 3D mesh.

Robustness Fig. 7: Performance under motion blur and heavy occlusion.


Critical Insight & Future Work

Takeaway: The shift toward "Uncalibrated Multi-view" is the next frontier for AR/VR. HGGT proves that with enough synthetic data and a "grounded" transformer architecture, we no longer need the "calibration checkerboard."

Limitations:

  • Detection Dependency: It still needs an external 2D detector (like ViTPose) to crop the hand first.
  • Metric Scale: While the 3D structure is perfect, the "absolute" size/distance (metric depth) remains a challenge without at least one reference measurement.

Future Outlook: Integrating the hand reconstruction with object reconstruction in a single feed-forward pass will be the logical next step for autonomous robotic grasping.


Note: For more details, code, and the synthetic dataset, visit the Project Page.

Find Similar Papers

Try Our Examples

  • Find recent papers on calibration-free multi-view human or hand pose estimation using Transformer architectures.
  • Which paper first introduced the DUSt3R or VGGT architecture, and how does HGGT adapt the "alternating attention" mechanism for small object reconstruction?
  • Explore research that applies synthetic-to-real data mixing strategies specifically for fine-grained 3D articulated object reconstruction like hands or tools.
Contents
[CVPR 2025] HGGT: Breaking the Calibration Barrier in 3D Hand Mesh Reconstruction
1. TL;DR
2. The "Calibration Dilemma" in Hand Reconstruction
3. Methodology: The Unified Grounding Architecture
3.1. 1. Unified Tokens
3.2. 2. Multi-View Feature Aggregation
4. The Power of Mixed Data
5. Experimental Excellence
5.1. SOTA Comparison
5.2. Robustness under Pressure
6. Critical Insight & Future Work