Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

Source: arXiv:2605.28812 · Published 2026-05-27 · By Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

TL;DR

This paper addresses the core challenge of sim-to-real transfer for tactile sensing in contact-rich dexterous manipulation. Traditional tactile representations for sim-to-real RL either simplify tactile data into coarse, low-dimensional features that transfer robustly but lack richness, or use raw tactile signals that contain detail but are difficult to align between sim and real hardware. To bridge this gap, the authors propose a physics-grounded Center-of-Pressure (CoP) tactile representation that compresses taxel-level tactile forces into a 3D contact force vector and contact location. This representation preserves key dense contact information yet remains compact and physically interpretable, enabling direct alignment between simulation and hardware. A differentiable sensor calibration method estimates unknown taxel orientations by matching estimated contact wrenches with torques during static equilibrium, requiring no ground-truth force data. The CoP-conditioned policies are evaluated on two challenging blind manipulation tasks—a multi-shape peg-in-hole insertion and ball balancing on a plate—and achieve zero-shot sim-to-real transfer on a 16-DoF Allegro hand with distributed fingertip tactile sensors. Compared to binary contact, force-magnitude, raw taxels, and other ablations, CoP yields higher success rates and more robust control across out-of-distribution initializations. Further latent analyses show the policies implicitly encode physical object properties like mass, indicating meaningful physical representations are learned through CoP. Overall, the paper contributes an effective tactile representation with a principled calibration method that significantly improves sim-to-real dexterous manipulation without requiring complex tactile simulation or task-specific real-world data collection.

Key findings

CoP-conditioned policies achieve 78% overall success on peg-in-hole insertion vs 53% for binary contact and 48% for raw taxel observations (Table 1).
On peg-in-hole task, CoP maintains 62% success under out-of-distribution peg poses compared to 20% for binary contact (Table 1).
For ball balancing, CoP policies sustain average time-to-fall 4.60s vs 1.99s for binary contact and 1.49s for raw taxel baseline (Table 2).
CoP representation enables zero-shot sim-to-real transfer on a 16-DoF Allegro hand with physical XELA uSkin sensors covering fingertips and palm.
Sensor calibration estimates taxel orientations via gradient-based optimization matching estimated wrenches to joint torques without requiring ground-truth force measurements.
Latent policy states linearly decode ball positions with RMSE ≈ 1.3cm and cluster by ball mass from 50g to 250g without explicit supervision (Fig. 7, Table 3).
CoP preserves both contact force and position info; ablations using only force or position individually underperform full CoP.
Raw taxel baselines perform worse than CoP or vector force due to high dimensionality and tactile simulation mismatch, showing CoP's robustness.

Threat model

The threat model assumes an adversary limited to providing arbitrarily challenging contact scenarios and perturbations in object positions and orientations, including out-of-distribution initializations. The adversary cannot modify the calibrated tactile sensor parameters, physical robot dynamics, or artificially manipulate sensor readings to spoof tactile inputs. Policies must robustly infer contact states from noisy tactile measurements but without adversarial sensor attacks or ground-truth force information.

Methodology — deep read

The paper tackles the sim-to-real tactile representation challenge for contact-rich dexterous manipulation. Its threat model assumes an adversary who cannot alter sensor calibration or physical robot dynamics but may present out-of-distribution contact scenarios and imperfect tactile signals. The core methodology consists of a novel tactile representation, a sensor calibration approach, and direct sim-to-real policy training and evaluation.

Data is collected from a multi-fingered 16-DoF Allegro robotic hand equipped with XELA uSkin tactile sensors, which have taxels arranged in grid arrays on fingertips and palm providing multi-axis force readings. Taxel origins are known from design specs, but taxel rotations relative to the sensor frame are unknown and must be calibrated from data. Calibration data is collected by applying random external fingertip contacts while maintaining the hand in static equilibrium via a PD controller; joint torques and raw taxel forces are recorded per timestep, spanning diverse contact positions and directions to excite various normal and shear forces.

The key tactile representation, Center-of-Pressure (CoP), models the tactile contact as a single 3D force vector and a 3D contact position within the sensor frame. This is derived from the raw taxel forces under a parametric stress distribution model that accounts for distance-dependent spreading of force through the compliant silicone layer and decomposes forces into normal and shear components using Gaussian radial weighting of taxel positions and normals. CoP is obtained by solving a regularized least squares problem under a linear system mapping taxel forces to the CoP force vector. This model is fully differentiable, permitting gradient-based optimization.

The sensor calibration procedure learns taxel rotation matrices (parameterized as arbitrary 3x3 matrices projected to SO(3) via SVD) by minimizing the mean squared error between joint torques computed from estimated CoP forces and the recorded joint torques during static equilibrium. Forward kinematics and Jacobians translate CoP forces to hand joint torques. This gradient-based calibration requires no ground-truth force labels and uses PyTorch automatic differentiation.

For sim-to-real alignment, standard rigid-body simulators only provide normal contact forces reliably; thus, shear forces are omitted from the CoP force vector to improve robustness. Domain randomization and actuator dynamics calibration via Bayesian optimization reduce sim-to-real gaps.

Policies are trained with asymmetric PPO using observations including proprioception and the chosen contact representation. A recurrent network provides temporal context without stacking observations. The critic accesses privileged states for stable training. Zero-shot transfer policies are evaluated on two blind manipulation tasks without vision: (1) peg-in-hole insertion with multiple peg shapes under randomized initial poses, (2) ball balancing on a plate atop the fingertips with 4 different ball types varying in mass and surface properties. Success rates, completion times, and time-to-fall metrics are gathered over 10 real-world trials per configuration.

Baselines include proprioception-only, binary contact, force magnitude, vector force only, position only, raw taxel forces, and human expert. Ablations isolate contributions of CoP components. The experiments include robustness tests to out-of-distribution initializations and random masking of taxel forces.

Post-training, the policy’s latent representations are linearly probed to decode the object state (position, velocity) and analyzed via PCA for emergent encoding of physical properties such as mass.

Overall, the methodology carefully integrates tactile sensor physics, differentiable calibration, simulation alignment, and rigorous zero-shot sim-to-real evaluation on complex multi-finger contact manipulation tasks.

Technical innovations

Introduction of the Center-of-Pressure (CoP) tactile representation combining a 3D contact force vector and contact location as a middle ground between coarse binary and raw taxel signals, enabling richer but transferable tactile inputs for sim-to-real transfer.
A differentiable sensor calibration method estimating unknown taxel frame rotations by minimizing joint torque errors from estimated CoP wrenches, requiring no ground-truth external force labels.
A physics-grounded parametric stress distribution model to map CoP forces to individual taxel forces accounting for compliant skin deformation and force spreading, maintaining differentiability for learning.
Direct zero-shot sim-to-real reinforcement learning on multi-finger tactile manipulation tasks using the aligned CoP representation without additional teacher-student distillation or task-specific real data training.

Datasets

Calibration dataset — ~tens of thousands of timesteps of recorded raw taxel forces, joint angles, and joint torques from random fingertip contacts during static equilibrium on Allegro hand (internal, not public).
Peg-in-hole insertion — 6 custom insertion peg-hole pairs with various shapes; 10 real-world trials per baseline and condition (internal).
Ball balancing — 4 different balls of diverse mass (50g-250g), size, and texture; 10 real-world trials per trial and baseline (internal).

Baselines vs proposed

Proprioception-only (base) peg-in-hole success rate = 43% overall vs CoP = 78%
Binary contact peg-in-hole success rate = 53% overall vs CoP = 78%
Raw taxel forces peg-in-hole success rate = 48% overall vs CoP = 78%
Vector force only (vec) peg-in-hole success rate = 67% vs CoP = 78%
Position only (pos) peg-in-hole success rate = 50% vs CoP = 78%
On ball balancing time-to-fall: binary contact = 1.99s vs CoP = 4.60s vs proprioception = 1.38s
Raw taxel baseline time-to-fall = 1.49s vs CoP = 4.60s on ball balancing

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.28812.

Fig 1

Fig 1: (a) CoP representation. (b) The proposed stress distribution model for XELA uSkin sensors [35].

Fig 2

Fig 2: Our proposed differentiable dynamics-based

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

CoP abstracts raw taxel readings into force and location, which may discard sensor-specific tactile details potentially helpful for more complex tasks.
Simulated shear forces were unreliable, so only surface-normal CoP forces are used, limiting representation richness in contact dynamics.
Simulators used do not model all real tactile sensory interactions (e.g., self-collisions, environment contacts), causing distribution mismatches in real sensors.
Current approach is validated only on fixed-base dexterous hands equipped with XELA uSkin sensors; generalization to arm-hand systems and other sensor types remains open.
Actuator dynamics calibration partially compensates for unmodeled friction and delays but subtle hardware effects still challenge policy robustness.
Policies are blind (no vision), which simplifies perception but limits applicability to tasks requiring multimodal sensing.

Open questions / follow-ons

How can CoP be extended to incorporate reliable shear forces and richer multi-contact pressure distributions without sacrificing sim-to-real transfer robustness?
Can the differentiable calibration method be adapted to other tactile sensors with different architectures or full-hand tactile coverage?
How to integrate CoP tactile representations effectively with vision and inertial modalities for general manipulation tasks beyond blind scenarios?
What are the comparative benefits of CoP in imitation learning or sample-efficient real-world reinforcement learning compared to sim-to-real RL alone?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights the importance of physically grounded, intermediate representations in bridging the gap between simulated and real sensory inputs. In CAPTCHA contexts where tactile or contact-based biometrics or proof-of-human-interaction signals might be leveraged, CoP illustrates how rich yet transferable sensor representations can capture subtle contact dynamics robustly, facilitating zero-shot transfer from simulations or controlled settings to real-world deployments.

The differentiable calibration approach also exemplifies a lightweight methodology for aligning simulated and real sensor data without requiring extensive ground-truth force measurements, which could inspire analogous calibration or sensor alignment techniques in CAPTCHA systems relying on physical interaction modeling. Overall, the insights into balancing richness of tactile representation with robustness to sim-to-real gaps may inform the design of next-generation proof-of-interaction sensors and policies that utilize touch-like feedback for bot detection under realistic operating conditions.

Cite

bibtex

@article{arxiv2605_28812,
  title={ Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation },
  author={ Jiahe Pan and Stelian Coros and Jitendra Malik and Toru Lin },
  journal={arXiv preprint arXiv:2605.28812},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.28812}
}

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​