Skip to content

Mana: Dexterous Manipulation of Articulated Tools

Source: arXiv:2606.13677 · Published 2026-06-11 · By Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

TL;DR

This paper addresses the challenging problem of dexterous manipulation of articulated tools with multi-fingered robotic hands. Articulated tools are difficult to handle because they require simultaneous grasp stability and actuation of internal joints, involving precise contact forces and coordination. Prior work mostly focused on rigid object manipulation or simplified scenarios, leaving articulated tool use largely unexplored especially for thin, small tools on tabletops. The authors present Mana (Manipulation Animator), a novel sim-to-real framework that treats dexterous manipulation as an animation problem. Mana leverages a coarse-to-fine pipeline which starts with a small amount of human input to specify functional affordances on the tool mesh, then automatically generates dense grasp keyframes defining manipulation skeletons. These keyframes are connected into trajectories using motion planning for collision-free reaching and short-horizon reinforcement learning (RL) for contact-rich transitions requiring precise position-force control. The resulting simulation dataset is used to train a point-cloud-conditioned diffusion policy that performs grasping and in-hand manipulation with zero-shot transfer to a real Allegro hand robot.

Experiments cover four widely different articulated tools (tongs, pliers, clothespins, syringes) that span different scales and joint types. The system achieves an average success rate around 70% for grasping and functional in-hand manipulation phases, outperforming teleoperation and open-loop baselines by large margins. Ablations show the importance of diverse grasp keyframes, force perturbation randomization, and data scale for robustness. The approach enables end-to-end autonomous grasping and manipulation of thin articulated tools from visual inputs using only simulation data, demonstrating a scalable way to tackle real-world dexterous tool use involving complex contact-rich dynamics.

Key findings

  • Mana achieves approx. 70–80% success rates for both grasping and in-hand manipulation across four different articulated tools, including tongs, pliers, clothespins, and syringes (Tab. 1).
  • Teleoperation baseline (using Geometric Retargeting) struggles with precision force control on thin tools, achieving at most 30% success on tongs and near 0% on clothespins and syringes.
  • Open-loop replay of Mana trajectories yields moderate performance (~40–60%), showing residual contact errors and pose estimation limitations.
  • Ablation in Fig. 7 demonstrates success rate scales with size of generated trajectory dataset and number of grasp keyframes, showing importance of dense functional state coverage.
  • Force perturbation randomization during RL training substantially improves robustness to real-world noise and contact dynamics (Fig. 7).
  • Mana’s fingertip hardware design with flattened, compliant silicone improves stable contact on thin object handles, critical for functional force application.
  • Zero-shot sim-to-real transfer is achieved without real-world domain adaptation by training the diffusion policy entirely on Mana-generated simulation data.
  • Functional tool-use tasks combining grasping and manipulation achieve 50–70% success (e.g., tong picking 7/10, plier cutting 5/10), despite reliance on manual wrist teleoperation for final fine alignment.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary model is not explicitly considered as this is a robotics manipulation problem rather than a security scenario. The system assumes full knowledge of the tool meshes and articulation models, with a small amount of human-labeled functional affordances per tool. The robot faces realistic noise such as perception errors and actuator uncertainty but no adversarial sabotage.

  2. Data: The data generation begins with user annotation of functional affordance regions on the tool mesh (<1 minute per tool). Using these labels, a grasp keyframe generator densely samples stable and functional fingertip grasp/contact states across tool configurations in simulation. The generation uses collision-aware inverse kinematics and stability filtering in IsaacLab simulator. Then, these keyframes are connected into full manipulation trajectories by two methods: motion planning via GPU-accelerated RRT-Connect for collision-free reaching and pre-grasp poses, and short-horizon reinforcement learning for dynamic, contact-rich transitions requiring precise position-force control. Data includes wrist poses, hand joint commands, object states, and phase labels such as grasping, opening, closing.

  3. Architecture / Algorithm: The learned policy is a point-cloud-conditioned diffusion model that inputs segmented tool point clouds (from RGB-D) plus robot proprioception and outputs delta 6D wrist poses and delta finger joint positions. First, the point clouds are encoded by a Perceiver-style transformer to compress into tokens, which then go through a lightweight diffusion model head to generate actions. Wrist control uses a differential IK solver, and fingers are controlled by low-level PD motors. The loss is standard denoising diffusion L2 on sampled noisy actions.

  4. Training Regime: Reinforcement learning for trajectory generation uses custom rewards combining tool pose matching, hand pose matching, and contact maintenance rewards, with randomization including controller PD gain noise, physical property randomization, force perturbations, and action noise to improve sim-to-real transfer. Training details like epochs and batch size are not precisely specified. The policy diffusion model is trained on the full generated dataset of successful trajectories. Point cloud randomization including noise and partial masking is applied during training.

  5. Evaluation Protocol: They evaluate separate phases—grasping, opening, closing—for 4 different tools with two object instances each. Each evaluation uses 10 trials per object. Baselines include teleoperation via Geometric Retargeting and open-loop replay of trajectories with manually initialized object poses. Success metrics are binary task completion rates. Ablations included dataset size, grasp keyframe diversity, and force randomization effects on success rates. Real-world deployment is zero-shot from simulation without domain adaptation.

  6. Reproducibility: The paper references IsaacLab simulation for grasp evaluation and uses publicly known robot hardware (Allegro hand, xArm7). The authors provide a project page suggesting code/data release, but no explicit frozen weights or closed dataset mentions. Some aspects like random seeds and exact hyperparameters are not detailed, limiting full reproduction details.

Concrete example: For the syringe tool, a user first labels the plunger and barrel affordance regions. The system generates multiple grasps around the plunger handle at various insertion depths. Pre-grasp poses are motion-planned from a random initial hand. Short-horizon RL policies are trained to squeeze and push the plunger inside the barrel while maintaining multi-finger stable contact despite the required 3–7 N force. The resulting trajectories form data for the diffusion policy which is deployed zero-shot on the real Allegro hand with a Realsense camera to grasp the syringe from the tabletop, lift it, and push the plunger to inject fluid in an autonomous sequence augmented by manual wrist teleoperation for delicate fine alignment.

Technical innovations

  • Framing articulated tool manipulation as a computer animation problem with a coarse-to-fine pipeline that decomposes manipulation sequences into grasp keyframes and short dynamic transitions, avoiding brittle end-to-end RL.
  • Procedural tool-specific functional affordance annotation for fast automatic generation of dense, diverse grasp keyframes covering whole manipulation state spaces.
  • Hybrid trajectory generation method combining GPU-accelerated RRT-Connect motion planning for pre-grasp phases and short-horizon reinforcement learning for contact-rich in-hand transitions with position-force coordination.
  • Training a point-cloud-conditioned transformer diffusion policy on synthetic trajectories enabling zero-shot sim-to-real transfer of dexterous grasping and functional in-hand tool use from RGB-D observation.

Datasets

  • Mana-generated articulated tool manipulation trajectories — Thousands of trajectories across 4 tool categories (tongs, pliers, clothespins, syringes) — Generated in IsaacLab simulation with human-provided affordance labels

Baselines vs proposed

  • Teleoperation (GeoRT): grasp success rates around 0.0–0.3 vs. Mana policy: 0.7–0.8 (Tab. 1)
  • Open-loop replay of Mana trajectories: grasp success approx. 0.3–0.6 vs. closed-loop Mana policy: 0.7–0.8
  • Teleoperation pliers cut success: 0.1 vs. Mana tool-use success: 0.5 (Tab. 2)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13677.

Fig 1

Fig 1: Mana (Manipulation Animator) is a framework for learning dexterous manipulation of

Fig 2

Fig 2: Physical Challenges of Articulated Tool Use. Left: Dexterous articulated tool manipu-

Fig 3

Fig 3: Mana Data System Overview. Mana takes a coarse-to-fine approach to generate tool

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

  • Current hardware maximum torque (0.7 Nm) limits handling of stiff tool-use requiring forces >10 N (e.g., trigger mechanisms).
  • Focus on precision grasps excludes power-grasping strategies common in humans; Allegro hand size and design limit power grasp applicability on thin tools.
  • Perception challenges remain for sub-millimeter alignment and detecting fingertip slip especially under occlusion, requiring future tactile/force sensing integration.
  • Skill chaining currently requires manual wrist teleoperation for fine alignment phases; fully autonomous sequential policy execution not yet realized.
  • Simulation imperfectly models contact physics and force dynamics, potentially causing tool slippage and failure under perturbations in real-world tests.

Open questions / follow-ons

  • How to incorporate high-frequency tactile and force sensing to detect and respond to slip under occlusion for improved robustness?
  • Can more realistic contact and friction physics be modeled to reduce sim-to-real gaps especially for complex dynamic tool use?
  • How to develop fully autonomous policies that chain multi-phase tool use skills without manual teleoperation for fine alignment?
  • What hardware improvements (e.g., stronger fingers, smaller hands) enable power grasp tool use across a wider range of articulated objects?

Why it matters for bot defense

While not directly related to CAPTCHA or bot-defense, the paper’s approach to dexterous articulated tool manipulation offers key insights for robotics applications requiring precise contact-rich force control, visual perception, and sim-to-real policy learning. Bot-defense engineers interested in robotic system security might consider the challenges of precise force coordination and sensor noise management highlighted here as analogous to physical attack vectors or sensor manipulation in interactive devices. The coarse-to-fine data generation pipeline combining motion planning and reinforcement learning may inspire similar multi-stage data augmentation for interaction models. Furthermore, the use of dense functional affordances to generate diverse simulation training data underscores the value of semantic scene understanding and procedural annotation for training robust policies against distributional shifts—paralleling needs in robust CAPTCHA solving or bot detection under adversarial conditions.

Cite

bibtex
@article{arxiv2606_13677,
  title={ Mana: Dexterous Manipulation of Articulated Tools },
  author={ Zhao-Heng Yin and Guanya Shi and Pieter Abbeel and C. Karen Liu },
  journal={arXiv preprint arXiv:2606.13677},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13677}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution