DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Source: arXiv:2605.16257 · Published 2026-05-15 · By Hanwen Wang, Weizhi Zhao, Xiangyu Wang, Siyuan Huang, He Lin, Boyuan Zheng et al.

TL;DR

The paper benchmarks multiple modern policy architectures—including ACT, Diffusion Policies, π0.5, and GR00T N1.5—under single-task, multi-task, and randomized domain conditions. The results reveal significant challenges in achieving robust task success, especially on bimanual and long-horizon tasks, and notable shortcomings such as failures in precise button pressing, insertion, and memory for conditioned sequential actions. The benchmark exposes trade-offs between model scale, pretraining, and architecture design. Overall, the work highlights the limitations of current methods and the importance of dexterous hand-focused data and evaluation pipelines to advance robot learning of human-level dexterous manipulation.

Key findings

DexJoCo includes 11 tasks spanning tool-use, bimanual manipulation, long-horizon execution, and reasoning, demonstrating broader functional coverage than prior benchmarks (Table 1).
Collected 1,100+ human demonstration trajectories via low-cost glove+tracker teleoperation system costing ~$2300 USD.
Under randomized visual conditions (rand-full), success rates for all policies drop sharply by approximately 30% on average, indicating limited robustness.
π0.5 model achieves highest overall average success rate of 50.4% (rand-obj) vs baseline ACT at 35.5% and Diffusion Policy Transformer (DP-T) at 47.6% (Table 2).
DP-C model with FiLM observation injection excels on precise operations like Unlock iPad and Pinch Tongs, outperforming other policies on these tasks.
Multi-task joint training degrades performance of DP-T by ~10% average success compared to single-task training, while π0.5 gains on some tasks but loses overall success (Table 3).
Action head randomization decreases π0.5 success rate by ~3% relative to retaining pretrained weights (rand-AH ablation).
Language-conditioned generalization is poor: π0.5 trained on digit passwords defaults to fixed action bias rather than true linguistic understanding (App. A).

Threat model

n/a — This work focuses on benchmarking dexterous manipulation policies in simulation and does not address adversarial threats or security concerns.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly defined since this is a benchmark paper; the goal is to evaluate dexterous manipulation policy generalization and robustness rather than security. The assumption is that policies operate in simulation with known physics and observation modalities but face visual and dynamics domain shifts.
Data Collection: The authors developed a low-cost human teleoperation system using Rokoko motion-capture gloves for hand pose and HTC Vive trackers for wrist pose. They designed a retargeting neural network (GeoRT) to convert human fingertip poses to Allegro robotic hand joint angles, optimizing several loss terms to preserve motion fidelity and avoid collision. The system collects 1,100+ trajectories across 11 dexterous manipulation tasks implemented in MuJoCo.
Benchmark Architecture & Environment: The robotic setup comprises a Franka Panda arm, Allegro Hand, and Rethink Robotics base within MuJoCo simulator. Tasks include both unimanual and bimanual scenarios with object interactions, articulated mechanisms, temporal constraints, and success criteria involving pose, joint states, contacts, and sequence order. Observation modalities include third-person RGB and RGB-D, wrist and hand joint states, and object poses. Action spaces encode absolute end-effector poses and hand joint angles.
Teleoperation & Retargeting: Teleop records absolute end-effector and hand joint targets executed in the simulator. The retargeting MLP is trained self-supervised on fingertip workspace coverage without paired annotations, with careful loss design (Ldir, Lcover, Lflat, Lpinch, Lcol) to ensure faithful and collision-free grasp replication.
Domain Randomization: To test robustness, domain randomization is applied during trajectory replay including randomized camera views, lighting colors and directions, tabletop textures, object placements, table height, and physics parameters like friction and stiffness.
Policy Training & Evaluation: Four policy families are benchmarked, including ACT (conditional variational autoencoder), Diffusion Policy (Transformer and CNN variants), π0.5 and GR00T N1.5 (pretrained vision-language-action models fine-tuned via Low-Rank Adaptation). Policies predict k-step action chunks conditioned on last h observations plus optional language instructions. Evaluation metrics include average task success rates over 50 trials per task across 11 tasks. Evaluations consider randomized object placements (“rand-obj”) and full domain randomization including visual aspects (“rand-full”). Multi-task training is tested by jointly training on all 11 tasks. Additional ablations evaluate robustness to dynamics randomization and action-head reinitialization.
Failure Mode & Analysis: Detailed breakdowns of failure causes are provided for leading policies (π0.5, DP-C), identifying deficiencies in fine-grained button pressing, insertion alignment, and temporal memory.
Reproducibility: Code, benchmark tasks, and data formats (includes LeRobot Dataset v3.0, DP Zarr) are made publicly available via the project webpage. Hardware designs and retargeting algorithm are described in detail to enable replication.

Example Workflow: A human operator uses the Rokoko glove to teleoperate the Allegro hand in MuJoCo performing the "Hammer Nail" task. Recorded trajectories encode absolute end-effector poses plus joint angle targets. The collected data is replayed under varied camera, lighting, and physics conditions to augment training data. Policy models like π0.5 are fine-tuned on this data with vision and proprioceptive inputs. Evaluation runs 50 repeated trials per task with randomized object placements and reports mean success and failure modes to quantify robustness and dexterous skill levels.

Technical innovations

Introduction of 11 functionally grounded dexterous manipulation tasks specifically designed to highlight tool-use, bimanual coordination, long-horizon tasks, and reasoning, addressing limitations of prior benchmarks that focus on simpler in-hand or pick-and-place tasks.
Development of a low-cost, high-fidelity teleoperation data collection system combining motion-capture gloves and wrist trackers, paired with a self-supervised retargeting algorithm (GeoRT) to map human fingertip motions to robotic hand joint configurations efficiently.
Comprehensive domain randomization framework integrated into trajectory replay to systematically evaluate the robustness of dexterous manipulation policies under visual and dynamic variations.
Empirical benchmarking of diverse modern policy architectures—including large pretrained vision-language-action models and diffusion-based planners—under multi-task, randomized domain, and action-head adaptation settings, revealing detailed failure modes.
Public release of a unified benchmark pipeline with task environments, data formats, and evaluation protocols aimed to standardize dexterous manipulation research and promote reproducibility.

Datasets

DexJoCo Demonstration Dataset — 1,100+ trajectories — collected via low-cost teleoperation system in MuJoCo simulation

Baselines vs proposed

DP-T (Diffusion Policy Transformer): average success rate 50.4% under rand-obj vs π0.5: 50.4%, ACT: 35.5%, DP-C: 47.6% (Table 2)
π0.5: average success 50.4% (rand-obj), drops to 20% under rand-full visual randomization (Table 2)
Multi-task training performance: DP-T degrades from 50.4% (single-task) to 33.2%, π0.5 drops slightly to 45.5% (Table 3)
Action head randomization reduces π0.5 success from 50.4% to 48.7% average, retaining pretrained head is beneficial (Table 3)
DP-C outperforms others on precise button-pressing tasks (Unlock iPad, Pinch Tongs) with ~52.0% success under rand-obj, versus ACT ~31.3% (Table 2)
π0.5 outperforms DP-T on multi-task Click Mouse and Pinch Tongs improving task-specific success by ~10-15% (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.16257.

Fig 1

Fig 1: Overview of DexJoCo. DexJoCo is a dexterous manipulation benchmark with a toolkit

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Limited sim-to-real fidelity - although domain randomization simulates variations, no real robot deployment or transfer evaluations are included.
Current observation and action spaces lack tactile or force sensing, limiting policies to vision and proprioception, which hinders contact-rich manipulation.
Multi-task training protocols degrade performance for some models, indicating insufficient scalability or capacity to handle diverse dexterous tasks.
Pretrained large models show limited ability to generalize language instructions beyond trained tokens, defaulting to fixed action biases.
Demonstration data collected is limited to 1,100 trajectories—while large for dexterous hands, still relatively small compared to manipulator gripper datasets.
Benchmark does not include adversarial or unexpected environment perturbations beyond randomized domain shifts.

Open questions / follow-ons

How to incorporate tactile and force sensing modalities into dexterous manipulation benchmarks and policies to improve contact-rich control?
What architectural advances or training schemes will better scale to multi-task learning without performance degradation in high-dimensional dexterous hands?
How to improve language-conditioned dexterous manipulation to achieve true semantic generalization beyond fixed action biases?
Which sim-to-real methods and simulation fidelity improvements are required for reliable zero-shot transfer of dexterous policies to physical robots?

Why it matters for bot defense

While DexJoCo is not directly related to CAPTCHA solving or bot defense, its contributions are relevant for researchers building sophisticated robot policies that require fine-grained manipulation and robust visual grounding under diverse conditions. For CAPTCHA applications that involve physical-world interactions or verification of human dexterity, insights from this benchmark on resilience to visual domain shifts, failure modes in precise actions, and multi-modal policy conditioning can inform design requirements. Additionally, the benchmarking pipeline and data collection methodologies illustrate approaches to rigorously evaluate high-dimensional action policies via human demonstration replay and domain randomization, potentially applicable to testing bot behavior in embodied environments. Overall, practitioners in bot defense and CAPTCHA should view DexJoCo primarily as a resource showcasing challenges and evaluation frameworks for generalist dexterous control, offering analogies for evaluating robustness and task generalization in vision-language-action models rather than direct application.

Cite

bibtex

@article{arxiv2605_16257,
  title={ DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo },
  author={ Hanwen Wang and Weizhi Zhao and Xiangyu Wang and Siyuan Huang and He Lin and Boyuan Zheng and Rongtao Xu and Gang Wang and Yao Mu and He Wang and Lue Fan and Hongsheng Li and Zhaoxiang Zhang and Tieniu Tan },
  journal={arXiv preprint arXiv:2605.16257},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.16257}
}

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​