ScaleHP: Estimating Hand Pose in Metric Space

Jing, Ruitao; Chen, Xingyu; Li, Hongyang; Jiang, Qing; Shi, Yukai; Zhang, Lei

ScaleHP: Estimating Hand Pose in Metric Space

Ruitao Jing^1,3,4,*, Xingyu Chen^2,*, Hongyang Li^4,5, Qing Jiang^4,5, Yukai Shi^1,4, Lei Zhang^3,4,5,†

¹ Tsinghua University ² Zhongguancun Academy ³ Visincept ⁴ International Digital Economy Academy (IDEA Research) ⁵ South China University of Technology arXiv preprint
^*Equal contribution ^†Corresponding author

ScaleHP estimates hand pose in true metric space from a single image

ScaleHP is the first one-stage framework to explicitly predict global hand scale, enabling accurate recovery of hand pose in true metric space from a single RGB image.

Abstract

Accurate metric-space hand pose estimation (HPE) is essential for immersive human-computer interaction and robotics. However, most existing methods predict poses in a root-relative coordinate system and cannot recover the hand at absolute metric scale. We observe that intrinsic proportional relationships among human hand bones encode stable anthropometric priors that correlate with overall hand size. Leveraging this insight, we present ScaleHP, an end-to-end one-stage framework that bypasses fragile extrinsic depth modules to recover the hand in metric space.

ScaleHP employs a transformer-based decoder with a novel scale token to fuse multi-scale morphological and appearance features. By solving for metric coordinates through a perspective-constrained least-squares approach, we achieve high-precision pose estimation in the camera coordinate system. ScaleHP delivers state-of-the-art performance, including 35.8 CS-MPJPE on FreiHand and 4.6 / 5.9 P-MPJPE on DexYCB and HO3Dv3, demonstrating that internal biological constraints significantly reduce both relative geometry and absolute metric errors.

Motivation

Precise hand tracking underpins VR/AR manipulation, embodied AI, and robotic teleoperation. Beyond relative finger articulation, these applications require knowing where the hand sits in 3D space and how large it is in real-world units. The core challenge is depth-scale ambiguity in monocular images: existing pipelines often rely on auxiliary depth modules or pre-computed depth maps, which become unreliable when only a hand is visible or when scenes differ from the training distribution.

ScaleHP instead exploits a stable biological prior: skeletal bone proportions are intrinsically correlated with overall hand dimensions. Because this scale cue comes from the hand itself, it remains robust to background changes and avoids the fragility of external depth estimation.

Contributions

Metric-Aware HPE

We explicitly model global hand scale from intrinsic anatomical bone proportions, mitigating monocular depth ambiguity and enabling metrically consistent predictions in camera space.

One-Stage ScaleHP

We propose the first one-stage end-to-end framework for metric-space hand pose estimation, centered on a scale token that interacts with 2D and 3D joint queries inside the metric decoder.

SOTA Performance

ScaleHP achieves 35.8 CS-MPJPE on FreiHand and 4.6 / 5.9 P-MPJPE on DexYCB and HO3Dv3, with ablations showing the scale token also reduces depth-axis geometry error.

Method

Overview of the ScaleHP framework — **Overview of ScaleHP.** A frozen detector extracts hand features, a Metric 2D–3D Decoder with a scale token predicts 2D/3D joints and global scale, and an analytic module solves for translation to recover metric-space pose.

1 Frozen Detector

Grounding DINO provides robust hand detection and multi-scale image features while remaining frozen during training.

2 Metric 2D–3D Decoder

21 joint queries and a dedicated scale token interact via deformable attention to predict 2D keypoints, canonical 3D pose, and a global scale scalar.

3 Scale Token

The scale token fuses morphological and appearance cues to estimate mean bone length, grounding absolute hand size in anthropometric priors.

4 Analytic Translation Solver

A training-free least-squares solve under perspective projection constraints reconstructs the full hand pose in camera metric space without an auxiliary depth network.

Results

CS-MPJPE directly measures 3D joint error in camera space without root or Procrustes alignment, making it the primary metric for absolute metric localization.

35.8

CS-MPJPE on FreiHand (mm)

4.6

P-MPJPE on DexYCB (mm)

5.9

P-MPJPE on HO3Dv3 (mm)

Method	FreiHand CS-MPJPE ↓	DexYCB CS-MPJPE ↓	HO3Dv3 CS-MPJPE ↓
CMR + GS	48.8	183.2	152.3
HandDGP + GS	46.2	222.1	132.6
NFV	42.4	—	—
ScaleHP (Ours)	35.8	136.3	50.7

State-of-the-Art Comparison (Aligned Metrics)

Root-relative and Procrustes-aligned metrics focus on articulated pose quality rather than absolute metric localization. ScaleHP remains state-of-the-art under these conventional protocols while excelling at camera-space accuracy.

Method	FreiHand		DexYCB		HO3Dv3
Method	CS-MPJPE ↓	P-MPJPE ↓	R-MPJPE ↓	P-MPJPE ↓	P-MPJPE ↓
ObMan	85.2	13.3	—	—	—
MANO CNN	71.3	11.0	—	—	—
CMR-PG	48.8	6.9	—	—	—
I2L-MeshNet	60.3	7.4	—	—	—
HandDGP	46.3	7.4	—	—	—
MobRecon	50.2	5.7	14.2	6.4	—
METRO	—	6.7	15.2	7.0	—
HandOccNet	—	—	14.0	5.8	—
H2ONet	—	—	14.0	5.7	—
Deformer	—	—	13.6	5.2	—
Zhou et al.	—	—	12.4	5.5	—
TI-Net	—	—	16.8	4.9	—
MaskHand	—	5.5	11.7	5.0	7.0
S²HAND	—	11.8	—	—	11.5
ArtiBoost	—	—	12.8	—	10.8
HandGCAT	—	—	—	—	9.1
AMVUR	—	6.2	—	—	8.7
SPMHand	—	—	—	—	8.6
Hamba	—	5.7	—	—	6.9
HandOS	—	5.0	—	5.2	6.8
ScaleHP (Ours)	35.8	5.0	10.3	4.6	5.9

Scale Token Ablation

Method	FreiHand CS-MPJPE ↓	DexYCB CS-MPJPE ↓	HO3Dv3 CS-MPJPE ↓
w/o Scale Token	44.3	40.2	42.0
w/ Scale Token	35.8	30.0	33.4

CS-MPJPE (mm) on each benchmark. The scale token is critical for absolute metric recovery.

Depth-Axis Regularization

Method	P-MPJPE ↓	X ↓	Y ↓	Z ↓
w/o Scale Token	5.6	2.3	2.4	3.5
w/ Scale Token	5.0	2.2	2.2	3.0

FreiHand aligned errors (mm). The scale token reduces depth-axis (Z) error most significantly.

Demo Videos

Real-world demonstrations of ScaleHP estimating hand pose in true metric space from monocular RGB input.

Real-time metric hand pose estimation from a single camera view.

Pick-and-place interaction demo with stable metric-scale hand tracking.

Qualitative Results

FreiHand: Comparison with Camera-Space Methods

Multi-method comparisons on the FreiHand evaluation set. Each plot overlays GT, ScaleHP, HandDGP, and CMR; top views reveal depth and scale errors in camera-space baselines.

FreiHand metric comparison sample 0010 — Metric Space

FreiHand top view comparison sample 0010 — Top View

FreiHand metric comparison sample 0009 — Metric Space

FreiHand top view comparison sample 0009 — Top View

FreiHand metric comparison sample 0008 — Metric Space

FreiHand top view comparison sample 0008 — Top View

In-the-Wild Results

Unconstrained real-world examples showing input images alongside metric-space reconstructions and multi-view renderings.

Metric-space reconstruction with accurate image-plane projection.

Robust pose under challenging hand-object interaction.

Consistent 3D geometry across varying viewpoints.

Reliable metric localization under severe occlusion.

Comparison with HaMeR

Metric-space 3D reconstructions compared against HaMeR on challenging in-the-wild samples. Top views highlight ScaleHP’s clearer depth ordering.

Input

Metric Space

Front View

Top View

ScaleHP

HaMeR

ScaleHP preserves clearer metric depth ordering between interacting hands.

Input

Metric Space

Front View

Top View

ScaleHP

HaMeR

More consistent absolute hand placement in camera space.

Downstream Application: Hand-to-Robot Retargeting

Why Retargeting Requires Metric Hand Pose

Hand-to-robot retargeting maps human hand motion to robot end-effector trajectories for teleoperation and dexterous manipulation. This pipeline needs more than relative finger articulation: the robot must know where the hand sits in the shared camera or workspace frame and how large it is in real physical units.

Absolute 3D placement. Root-relative HPE only describes joint angles around a local origin; it cannot specify the hand’s position relative to objects or the robot base.
Metric scale consistency. Without absolute hand size, human-to-robot mapping suffers scale drift, grasp offset, and depth-axis misalignment during motion transfer.
Robustness without external depth. Auxiliary depth modules or manual scale calibration break down when only the hand is visible or when deployment scenes differ from training data.

How ScaleHP Addresses These Requirements

Reliable Retargeting

Directly outputs camera-space poses at absolute metric scale, reducing depth ambiguity that causes grasp misalignment and unstable teleoperation.

Absolute Metric Scale

Recovers true hand dimensions in millimeters, preventing scale drift when transferring human motion to robot execution.

Camera-Space Coordinates

Predictions align naturally with the robot workspace, simplifying projection and inverse kinematics pipelines.

Given the retargeting requirements above, ScaleHP metric pose enables stable hand-to-robot-arm mapping with significantly improved spatial consistency in real manipulation tasks.

Summary

ScaleHP introduces a transformer decoder with a dedicated scale token that fuses local joint morphology and global appearance through multi-scale deformable attention. Combined with a training-free perspective projection solver, it recovers absolute 3D hand geometry in a unified least-squares manner. The framework achieves state-of-the-art metric-space accuracy on FreiHand, DexYCB, and HO3Dv3 while generalizing robustly to unconstrained in-the-wild images.