Touch-R1: Reinforcing Touch Reasoning in MLLMs

Source: arXiv:2605.27154 · Published 2026-05-26 · By Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan

TL;DR

This paper addresses the underexplored problem of tactile reasoning in multimodal large language models (MLLMs). Unlike prior tactile-language models primarily trained using supervised or contrastive learning, which often rely on visual priors and fail to ground predictions in true physical evidence, Touch-R1 introduces a reinforcement learning paradigm with rule-based rewards specifically designed for touch. Key challenges tackled include the ordinal nature of tactile physical attributes (e.g., hardness, roughness) and sensor heterogeneity leading to cross-sensor distribution shifts. The authors contribute TouchReason-1M, a large-scale synchronized tactile dataset with over 1 million tactile data pairs from four distinct optical tactile sensors, and TouchReason-Bench, a rigorous tactile reasoning evaluation framework. Touch-R1 is built on a Qwen2.5-VL-7B backbone and trains using a novel tactile-grounded GRPO objective combining ordinal-aware accuracy, cross-sensor physical consistency, and input-side tactile grounding to explicitly reward tactile-use and physical grounding rather than superficial visual shortcuts. On the diverse and physically realistic TouchReason-Bench, Touch-R1 outperforms strong tactile-specialist and frontier MLLM baselines by large margins (e.g., +18.4% over Octopi-13B and +24.7% over GPT-4o average accuracy). Its structured <perceive>-<compare>-<conclude> reasoning traces demonstrate emergent physical verification and revision capabilities, showing the potential to ground multimodal reasoning effectively in tactile evidence.

Key findings

Touch-R1-7B achieves an average accuracy score of 60.1 on TouchReason-Bench compared to 47.5 for SToLa (prior tactile-specialist), 41.7 for Octopi-13B, and 35.4 for GPT-4o (closed-source MLLMs).
Ordinal-aware reward reduces mean absolute error (OMAE) from 0.42 to 0.31, improving ordinal tactile attribute predictions.
Cross-sensor consistency reward increases cross-sensor consistency (CSC) metric from 54.3 to 69.2, encouraging physical agreement across heterogeneous sensors.
Input-side tactile grounding objective forces model sensitivity to tactile input, helping prevent reliance on visual/object-category shortcuts.
TouchReason-1M contains 1.32M tactile frames collected from 1000+ objects over 9 material categories using 4 different optical tactile sensors.
Training using tactile-grounded GRPO with combined ordinal, consistency, and format rewards plus input grounding outperforms a Qwen2.5-VL-7B baseline by over 32 accuracy points overall.
TouchReason-Bench evaluates multiple tactile reasoning aspects including hardness, roughness, protrusion, material identification, ordinal reasoning, and visual-tactile conflict resolution.
Structured reasoning traces (<perceive>, <compare>, <conclude>) enable tactile-grounded explanations that revise misleading visual priors, as shown qualitatively and quantitatively.

Threat model

The adversary is implicitly the model's own tendency to exploit visual or object-category priors and shortcuts rather than the tactile input stream for physical attribute prediction. Touch-R1 assumes adversaries cannot directly manipulate or spoof tactile sensor inputs adversarially but may produce incorrect reasoning by ignoring tactile evidence. The model must be trained to resist such shortcut learning and rely truly on tactile data.

Methodology — deep read

The study begins with a threat model envisioning an MLLM that receives synchronized tactile and visual inputs and must reason about physical object properties, overcoming misleading visual priors through tactile-grounded multi-sensor evidence. The adversary is implicitly the model's own tendency to rely on visual shortcuts or object-category priors rather than touch signals, which it cannot circumvent through adversarial means but must be trained against.

Data provenance is key: the authors collected TouchReason-1M, a large-scale tactile-language multimodal dataset with 1.32M tactile frames stemming from over 14,700 valid sequences (synchronized pairs) involving 1000+ real objects spanning 9 material categories. The data was collected using four diverse optical tactile sensors (GelSight Mini, Xense, Tac3D, DM-Tac X) under a force-controlled protocol involving pressing and sliding interactions. Raw tactile images are converted to multiple complementary fields: deformation, shear, depth, and force. Human annotators labeled ordinal attributes hardness, roughness, protrusion (three-level ordinal scales), plus categorical material labels, cross-validated with high inter-annotator agreement (Cohen’s kappa 0.78-0.82). The set was used to generate over 360K QA pairs with reasoning-based rationales grounded in tactile evidence, verified for quality.

Touch-R1 architecture builds on the Qwen2.5-VL-7B multimodal LLM backbone, augmented with a ViT-based tactile encoder that extracts features from tactile image sequences. Training occurs in 3 stages: (1) tactile dynamics pretraining with a future tactile token prediction loss (KL divergence to next frame tokens) to encourage sensitivity to deformation dynamics rather than static appearance; (2) supervised fine-tuning (SFT) on curated QA pairs from TouchReason-1M, where the model learns to output structured reasoning traces in <perceive>–<compare>–<conclude> segments answering questions about tactile physical attributes; (3) reinforcement learning using a tactile-grounded GRPO objective.

The GRPO reward function combines three output-side components: an ordinal-aware accuracy reward replacing binary correctness with graduated rewards that penalize according to ordinal distance between predicted and true attributes; a cross-sensor physical consistency reward measuring agreement among per-sensor and combined conclusions across different tactile sensors; and a structured-format reward enforcing strict adherence to the structured output template for reliable reward parsing. These output rewards ensure answers are not only correct but physically grounded and consistent across sensors.

To guarantee actual tactile input dependence (not just correct answers by visual priors), an input-side tactile grounding objective is included. This is a KL divergence loss between the policy distribution over tokens when given intact tactile sequences versus perturbed tactile streams (randomly masked/replaced with Gaussian noise), encouraging sensitivity to the presence or absence of real tactile data.

Together, the total RL objective augments the GRPO reward with this tactile grounding constraint to optimize a policy that correctly uses physically consistent tactile evidence with explicit ordinal reasoning, structured outputs, and sensitivity to tactile input.

Evaluation is performed on TouchReason-Bench, a held-out set of 200 unseen objects with 4,800 QA pairs across the four sensors and nine materials. Metrics include hardness, roughness, protrusion, and material classification accuracy; ordinal error (OMAE); compositional reasoning (exact match); comparative reasoning (pairwise and listwise consistency); and cross-sensor consistency. Multiple strong baselines are compared including open-source general multi-modal LLMs (Qwen2.5-VL, LLaVA, InternVL), tactile-specialist models (Octopi variants, SToLa, VTV-LLM), and closed-source frontier MLLMs (GPT-4o, Gemini-2.5-Pro, Claude-3.5-Sonnet).

An example end-to-end: given 8 tactile frames plus an Fz-depth force curve and a question (e.g. "What is the hardness?"), the model first encodes tactile dynamics, then generates structured reasoning in <perceive> describing tactile observations (e.g. deformation magnitude), next <compare> analyzing physical implications (e.g. contact deformation stronger for hard than soft objects), and finally <conclude> predicting the hardness level. Rewards assess if ordinal predictions are close to ground truth, if multiple sensors' conclusions agree, and if the reasoning structure is preserved. Input perturbations check if the answer depends on real tactile signals rather than shortcuts.

All training used 16 NVIDIA H200 GPUs. Hyperparameters include AdamW with different learning rates for each stage, group size G=8 and KL coefficient 1e-3 for GRPO.

The authors plan to release code and data publicly. However, some datasets and model weights are currently closed-source or proprietary (e.g. Gemini-2.5 used for QA generation).

Technical innovations

Design of a tactile-grounded GRPO reward combining ordinal-aware accuracy, cross-sensor physical consistency, and structured reasoning format control rewards to supervise tactile reasoning beyond simple correctness.
Input-side tactile grounding objective enforcing model dependence on tactile data via KL divergence between intact and perturbed tactile inputs, preventing tactile-blind shortcuts.
Construction of TouchReason-1M, a large-scale multimodal tactile reasoning dataset with over 1M synchronized tactile pairs collected from four heterogeneous optical tactile sensors under standardized force-controlled protocols.
Structured <perceive>–<compare>–<conclude> output format enabling interpretable tactile-grounded reasoning traces amenable to rule-based reinforcement learning rewards.

Datasets

TouchReason-1M — 1,323,000 tactile frames from 14,700 sequences over 1000+ objects — collected using 4 optical tactile sensors (GelSight Mini, Xense, Tac3D, DM-Tac X) under force-controlled protocols; human-labeled with ordinal physical attributes and material categories.
TouchReason-Bench — 4,800 QA pairs across 200 held-out objects and 4 sensors — held-out test benchmark derived from TouchReason-1M with verification QA for tactile perception and tactile-visual conflict resolution.

Baselines vs proposed

GPT-4o: Avg accuracy = 35.4% vs Touch-R1-7B: 60.1%
Octopi-13B: Avg accuracy = 41.7% vs Touch-R1-7B: 60.1%
SToLa: Avg accuracy = 47.5% vs Touch-R1-7B: 60.1%
Qwen2.5-VL-7B zero-shot: Avg accuracy = 28.0%; after cold-start supervised fine-tuning: 47.8%; with GRPO (binary reward): 50.1%; plus ordinal reward: 54.6%; plus consistency reward: 58.3%; full Touch-R1: 60.1%
Touch-R1 improves cross-sensor consistency CSC from 31.4 (Qwen2.5 zero-shot) to 71.3 and reduces ordinal mean absolute error OMAE from 0.83 to 0.24 compared to baselines.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.27154.

Fig 1

Fig 1: Motivating example and benchmark overview. Left: Touch-R1 improves over representative closed-

Fig 2

Fig 2: Overview of TouchReason-1M collected with four optical tactile sensors on 1000+ objects across

Fig 3

Fig 3: Overview of TouchReason QA Pairs construction. The upper panel shows data collection, human

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 4

Fig 4: Touch-R1: a three-stage framework for tactile-grounded reasoning. (1) Tactile Dynamics

Fig 8

Fig 8 (page 5).

Limitations

The dataset and training focus exclusively on optical tactile sensors; thus, results may not generalize to other tactile sensing modalities (e.g., force/torque only, capacitive).
Human-generated QA pairs rely on LLM prompt generation (Gemini 2.5 Pro), which may introduce bias or errors despite filtering; full end-to-end manual QA creation was limited.
Evaluation uses a curated benchmark with held-out objects but no adversarial attacks or real-world deployment tested; model robustness under adversarial tactile perturbations is unassessed.
Cross-sensor consistency reward assumes ordinal compatibility but does not guarantee pixel-level or fine-grained alignment across differing optical sensor outputs.
While tactile grounding objective penalizes tactile blindness, it is a proxy KL-based method that may not perfectly isolate reliance on tactile data versus confounded multimodal cues.
The structured reasoning format imposes template constraints that may limit flexibility of natural language outputs or unintentionally bias training.

Open questions / follow-ons

How well does the GRPO-based tactile grounding objective generalize to other tactile sensor modalities beyond optical tactile sensors?
Can the structured <perceive>-<compare>-<conclude> reasoning framework be extended to incorporate temporal reasoning over longer tactile sequences or active exploration?
What is the robustness of Touch-R1 to real-world tactile noise, sensor failures, or adversarial perturbations of tactile inputs?
How can the dataset and model scale to more complex multi-object scenes or fine-grained physical property estimation beyond ordinal categories?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work provides a pioneering approach to integrating tactile evidence into multimodal reasoning models, emphasizing the importance of grounding predictions in physical sensor data rather than relying on superficial visual cues or object category priors. The reinforcement learning framework with explicit tactile-use verification could inspire approaches for verifying human interaction with haptics-enabled CAPTCHA systems or devices requiring physical contact proof. The challenges identified here—such as modality-specific ordinal labels, cross-sensor distribution shifts, and tactile-grounded explanation tracing—highlight critical considerations when integrating low-level sensory inputs into reasoning models, which may parallel visual or behavioral signal grounding. However, deploying such tactile-grounded reasoning requires large-scale carefully annotated datasets and specialized sensors, limiting immediate applicability but setting a clear path for future biometric and bot-detection research incorporating touch or force feedback modalities. Understanding and mitigating tactile-blind shortcuts may be valuable for fraud detection models relying on physical interaction evidence.

Cite

bibtex

@article{arxiv2605_27154,
  title={ Touch-R1: Reinforcing Touch Reasoning in MLLMs },
  author={ Yingxin Lai and Yafei Zhou and Fucai Zhu and Siyu Zhu and Weihao Yuan },
  journal={arXiv preprint arXiv:2605.27154},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.27154}
}

Touch-R1: Reinforcing Touch Reasoning in MLLMs ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​