ScaleHP: Estimating Hand Pose in Metric Space
*Equal contribution †Corresponding author
Abstract
Accurate metric-space hand pose estimation (HPE) is essential for immersive human-computer interaction and robotics. However, most existing methods predict poses in a root-relative coordinate system and cannot recover the hand at absolute metric scale. We observe that intrinsic proportional relationships among human hand bones encode stable anthropometric priors that correlate with overall hand size. Leveraging this insight, we present ScaleHP, an end-to-end one-stage framework that bypasses fragile extrinsic depth modules to recover the hand in metric space.
ScaleHP employs a transformer-based decoder with a novel scale token to fuse multi-scale morphological and appearance features. By solving for metric coordinates through a perspective-constrained least-squares approach, we achieve high-precision pose estimation in the camera coordinate system. ScaleHP delivers state-of-the-art performance, including 35.8 CS-MPJPE on FreiHand and 4.6 / 5.9 P-MPJPE on DexYCB and HO3Dv3, demonstrating that internal biological constraints significantly reduce both relative geometry and absolute metric errors.
Motivation
Precise hand tracking underpins VR/AR manipulation, embodied AI, and robotic teleoperation. Beyond relative finger articulation, these applications require knowing where the hand sits in 3D space and how large it is in real-world units. The core challenge is depth-scale ambiguity in monocular images: existing pipelines often rely on auxiliary depth modules or pre-computed depth maps, which become unreliable when only a hand is visible or when scenes differ from the training distribution.
ScaleHP instead exploits a stable biological prior: skeletal bone proportions are intrinsically correlated with overall hand dimensions. Because this scale cue comes from the hand itself, it remains robust to background changes and avoids the fragility of external depth estimation.
Contributions
Metric-Aware HPE
We explicitly model global hand scale from intrinsic anatomical bone proportions, mitigating monocular depth ambiguity and enabling metrically consistent predictions in camera space.
One-Stage ScaleHP
We propose the first one-stage end-to-end framework for metric-space hand pose estimation, centered on a scale token that interacts with 2D and 3D joint queries inside the metric decoder.
SOTA Performance
ScaleHP achieves 35.8 CS-MPJPE on FreiHand and 4.6 / 5.9 P-MPJPE on DexYCB and HO3Dv3, with ablations showing the scale token also reduces depth-axis geometry error.
Method
1 Frozen Detector
Grounding DINO provides robust hand detection and multi-scale image features while remaining frozen during training.
2 Metric 2D–3D Decoder
21 joint queries and a dedicated scale token interact via deformable attention to predict 2D keypoints, canonical 3D pose, and a global scale scalar.
3 Scale Token
The scale token fuses morphological and appearance cues to estimate mean bone length, grounding absolute hand size in anthropometric priors.
4 Analytic Translation Solver
A training-free least-squares solve under perspective projection constraints reconstructs the full hand pose in camera metric space without an auxiliary depth network.
Results
CS-MPJPE directly measures 3D joint error in camera space without root or Procrustes alignment, making it the primary metric for absolute metric localization.
| Method | FreiHand CS-MPJPE ↓ | DexYCB CS-MPJPE ↓ | HO3Dv3 CS-MPJPE ↓ |
|---|---|---|---|
| CMR + GS | 48.8 | 183.2 | 152.3 |
| HandDGP + GS | 46.2 | 222.1 | 132.6 |
| NFV | 42.4 | — | — |
| ScaleHP (Ours) | 35.8 | 136.3 | 50.7 |
State-of-the-Art Comparison (Aligned Metrics)
Root-relative and Procrustes-aligned metrics focus on articulated pose quality rather than absolute metric localization. ScaleHP remains state-of-the-art under these conventional protocols while excelling at camera-space accuracy.
| Method | FreiHand | DexYCB | HO3Dv3 | ||
|---|---|---|---|---|---|
| CS-MPJPE ↓ | P-MPJPE ↓ | R-MPJPE ↓ | P-MPJPE ↓ | P-MPJPE ↓ | |
| ObMan | 85.2 | 13.3 | — | — | — |
| MANO CNN | 71.3 | 11.0 | — | — | — |
| CMR-PG | 48.8 | 6.9 | — | — | — |
| I2L-MeshNet | 60.3 | 7.4 | — | — | — |
| HandDGP | 46.3 | 7.4 | — | — | — |
| MobRecon | 50.2 | 5.7 | 14.2 | 6.4 | — |
| METRO | — | 6.7 | 15.2 | 7.0 | — |
| HandOccNet | — | — | 14.0 | 5.8 | — |
| H2ONet | — | — | 14.0 | 5.7 | — |
| Deformer | — | — | 13.6 | 5.2 | — |
| Zhou et al. | — | — | 12.4 | 5.5 | — |
| TI-Net | — | — | 16.8 | 4.9 | — |
| MaskHand | — | 5.5 | 11.7 | 5.0 | 7.0 |
| S2HAND | — | 11.8 | — | — | 11.5 |
| ArtiBoost | — | — | 12.8 | — | 10.8 |
| HandGCAT | — | — | — | — | 9.1 |
| AMVUR | — | 6.2 | — | — | 8.7 |
| SPMHand | — | — | — | — | 8.6 |
| Hamba | — | 5.7 | — | — | 6.9 |
| HandOS | — | 5.0 | — | 5.2 | 6.8 |
| ScaleHP (Ours) | 35.8 | 5.0 | 10.3 | 4.6 | 5.9 |
Scale Token Ablation
| Method | FreiHand CS-MPJPE ↓ | DexYCB CS-MPJPE ↓ | HO3Dv3 CS-MPJPE ↓ |
|---|---|---|---|
| w/o Scale Token | 44.3 | 40.2 | 42.0 |
| w/ Scale Token | 35.8 | 30.0 | 33.4 |
CS-MPJPE (mm) on each benchmark. The scale token is critical for absolute metric recovery.
Depth-Axis Regularization
| Method | P-MPJPE ↓ | X ↓ | Y ↓ | Z ↓ |
|---|---|---|---|---|
| w/o Scale Token | 5.6 | 2.3 | 2.4 | 3.5 |
| w/ Scale Token | 5.0 | 2.2 | 2.2 | 3.0 |
FreiHand aligned errors (mm). The scale token reduces depth-axis (Z) error most significantly.
Demo Videos
Real-world demonstrations of ScaleHP estimating hand pose in true metric space from monocular RGB input.
Real-time metric hand pose estimation from a single camera view.
Pick-and-place interaction demo with stable metric-scale hand tracking.
Qualitative Results
FreiHand: Comparison with Camera-Space Methods
Multi-method comparisons on the FreiHand evaluation set. Each plot overlays GT, ScaleHP, HandDGP, and CMR; top views reveal depth and scale errors in camera-space baselines.









In-the-Wild Results
Unconstrained real-world examples showing input images alongside metric-space reconstructions and multi-view renderings.




Metric-space reconstruction with accurate image-plane projection.




Robust pose under challenging hand-object interaction.




Consistent 3D geometry across varying viewpoints.




Reliable metric localization under severe occlusion.
Comparison with HaMeR
Metric-space 3D reconstructions compared against HaMeR on challenging in-the-wild samples. Top views highlight ScaleHP’s clearer depth ordering.








ScaleHP preserves clearer metric depth ordering between interacting hands.








More consistent absolute hand placement in camera space.
Downstream Application: Hand-to-Robot Retargeting
Why Retargeting Requires Metric Hand Pose
Hand-to-robot retargeting maps human hand motion to robot end-effector trajectories for teleoperation and dexterous manipulation. This pipeline needs more than relative finger articulation: the robot must know where the hand sits in the shared camera or workspace frame and how large it is in real physical units.
- Absolute 3D placement. Root-relative HPE only describes joint angles around a local origin; it cannot specify the hand’s position relative to objects or the robot base.
- Metric scale consistency. Without absolute hand size, human-to-robot mapping suffers scale drift, grasp offset, and depth-axis misalignment during motion transfer.
- Robustness without external depth. Auxiliary depth modules or manual scale calibration break down when only the hand is visible or when deployment scenes differ from training data.
How ScaleHP Addresses These Requirements
Reliable Retargeting
Directly outputs camera-space poses at absolute metric scale, reducing depth ambiguity that causes grasp misalignment and unstable teleoperation.
Absolute Metric Scale
Recovers true hand dimensions in millimeters, preventing scale drift when transferring human motion to robot execution.
Camera-Space Coordinates
Predictions align naturally with the robot workspace, simplifying projection and inverse kinematics pipelines.
Given the retargeting requirements above, ScaleHP metric pose enables stable hand-to-robot-arm mapping with significantly improved spatial consistency in real manipulation tasks.
Summary
ScaleHP introduces a transformer decoder with a dedicated scale token that fuses local joint morphology and global appearance through multi-scale deformable attention. Combined with a training-free perspective projection solver, it recovers absolute 3D hand geometry in a unified least-squares manner. The framework achieves state-of-the-art metric-space accuracy on FreiHand, DexYCB, and HO3Dv3 while generalizing robustly to unconstrained in-the-wild images.