Skip to content

RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Source: arXiv:2606.13040 · Published 2026-06-11 · By Dayu Xia, Yue Shi, Yao Mu, Huiting Ji, Chaofan Ma, Yingjie Zhou et al.

TL;DR

RoboProcessBench addresses the critical gap in evaluating whether vision-language models (VLMs) possess fine-grained process-aware understanding necessary for robotic manipulation tasks. Unlike prior evaluations focused primarily on binary success/failure outcomes, this benchmark decomposes manipulation execution into 12 diagnostic task families covering both static monitoring—such as phase and contact detection—and dynamic reasoning tasks like temporal ordering and progress estimation. The authors compile ProcessData, a curated dataset of nearly 58,000 QA pairs across 260 diverse manipulation tasks drawn from four robotics datasets with physically grounded execution traces, to enable rigorous evaluation and supervised fine-tuning of VLMs.

Extensive experiments expose significant limitations in current open- and closed-source VLMs when evaluated zero-shot on RoboProcessBench, especially on progress-centric and temporal reasoning tasks (e.g., primitive-local progress recognition at 34.4% accuracy near random chance). However, supervised fine-tuning on ProcessData-SFT substantially boosts performance across local state recognition, motion understanding, and primitive-level reasoning, demonstrating the learnability of many process-aware signals. Temporal reasoning and outcome prediction remain challenging, highlighting key directions for future model improvements. Overall, RoboProcessBench provides both a structured diagnostic evaluation suite and a learnable supervision substrate to advance VLM capabilities as robotic process-aware judges, critics, and failure detectors.

Key findings

  • ProcessData contains ~58,000 process-aware QA items spanning 260 manipulation tasks across 12 diagnostic task families.
  • Zero-shot VLM accuracy varies widely; simpler static process tasks (e.g., motion state recognition T6) achieve around 58-69%, while progress-centric tasks (primitive-local progress T5) score near chance at ~34.4%.
  • Temporal reasoning tasks like temporal ordering (T8) achieve only around 17-22% zero-shot accuracy, close to random chance (20%), indicating poor inherent temporal understanding.
  • Two VLMs (Qwen2.5-VL-7B and InternVL-3-8B) fine-tuned on ProcessData-SFT improve static monitoring accuracy by ~43-44 points (to ~77.2%) and dynamic reasoning by ~21-24 points (to ~69%).
  • Fine-tuning improved primitive progress recognition (T5) accuracy from ~34% to ~45%, an 11-point gain, showing local progress signals are partially learnable.
  • Post-trained models achieve up to 92-97% accuracy on current primitive recognition (T10) and primitive chain restoration (T12), highlighting gains in primitive-level cues.
  • Outcome prediction (T7) and temporal ordering (T8) remain difficult post fine-tuning with modest or no improvement, showing enduring bottlenecks.
  • RoboProcessBench uses strict episode-level splits between ProcessData-SFT and Eval, ensuring no leakage and reflecting true generalization.

Threat model

The adversary context is a robotic vision-language model expected to interpret ongoing manipulation executions from visual observations and task metadata, without access to privileged internal robot states or control signals beyond those provided. The model must infer intermediate physical states, process progress, and temporal relationships from visual and contextual cues but is not assumed to perform active attacks or adversarial manipulation.

Methodology — deep read

The paper defines a threat model of robotic manipulation systems employing VLMs as process judges, critics, and failure detectors, requiring fine-grained understanding of the physical and temporal evolution of manipulation—specifically the intermediate states rather than only final success. The adversary model is implicitly the VLM trying to accurately predict these process states from visual and contextual inputs, with no explicit adversarial attack scenario described.

Data provenance combines four robotic manipulation datasets representing different aspects of manipulation: GM-100 (goal-conditioned tasks), RH20T (contact-rich multimodal traces), REASSEMBLE (primitive-level action chains), and AIST-Bimanual (bimanual coordination). These provide physically grounded execution traces with aligned video, state, and annotation signals. The 58,000 QA items cover 12 diagnostic families including phase, contact, motion, temporal order, outcome prediction, and primitive transitions. ProcessData is split into ~85% ProcessData-SFT for supervised fine-tuning and ~15% ProcessData-Eval for evaluation, with strict episode-level splits to prevent leakage.

RoboProcessBench decomposes manipulation episodes into Local Process Units (LPUs), temporally localized segments anchored on contact onset, persistence, or primitive boundaries. From each LPU, task family-specific visual extractors select input frames or short clips (depending on static vs dynamic reasoning needs) and label constructors derive ground-truth answers mapped to multiple-choice questions. For example, phase recognition uses a single frame; temporal ordering uses three shuffled frames; primitive chain restoration uses a small clip and masked action slots.

The evaluation formulates all tasks as multiple-choice visual question answering, reporting accuracy. Baselines tested include open-source VLMs (Qwen2.5/3, InternVL variants, RoboBrain, GLM) and closed-source models (Gemini, GPT variants, Claude variants). Two VLMs (Qwen2.5-VL-7B, InternVL-3-8B) are further fine-tuned on ProcessData-SFT using LoRA with frozen vision encoders, performed on NVIDIA H200 GPUs. Training uses standard supervised fine-tuning with rank 8 LoRA over multiple epochs (exact epochs not clearly specified).

The evaluation protocol emphasizes task-family level diagnostic analysis rather than single aggregate score, exposing distinct capabilities such as local state monitoring versus temporal reasoning. Results show zero-shot models fragmentary strengths and widespread weaknesses, especially in primitive-local progress and temporal reasoning tasks. Post-training improves many local state and primitive-aware accuracies, but temporal ordering and reliable outcome prediction remain challenges.

Reproducibility is supported by a public project webpage and shared ProcessData dataset; details on code release and frozen weights are not explicitly stated but the paper implies availability of ProcessData and training scripts. The benchmark construction favors source-native labels and calibrates using limited human audit rather than full manual labeling, maintaining physical grounding and scalability.

A concrete evaluation example would be: from a manipulation episode segmented into LPUs by detected contact events, a short video clip capturing an LPU is extracted, along with task context. The VLM is prompted with a multiple-choice question, e.g., "What local process phase is currently active?" The model predicts the answer from frame(s) using learned visual-linguistic embeddings. Ground-truth labels derive from execution traces aligned with the video. Accuracy over many held-out LPUs and task families quantifies model process-awareness capability.

Technical innovations

  • Decomposition of robotic manipulation process understanding into 12 complementary static and dynamic VLM-evaluable question families covering fine-grained local states, primitive progress, temporal order, and outcome.
  • Construction of ProcessData benchmark of ~58k physically grounded QA items across 260 tasks, aggregating complementary robotics datasets for unified process-level supervision.
  • Introduction of Local Process Units (LPUs) as temporally localized decision segments anchored on contact and primitive boundaries for fine-grained process-aware evaluation.
  • Demonstration that supervised fine-tuning on ProcessData-SFT significantly improves multiple VLM backbones' process understanding, proving that these cues are visually learnable.
  • Identification of persistent bottlenecks in temporal ordering and outcome prediction, guiding future enhancements in temporal modeling and process reasoning.

Datasets

  • ProcessData — ~58,000 QA items — Aggregated from GM-100, RH20T, REASSEMBLE, and AIST-Bimanual robotic manipulation datasets

Baselines vs proposed

  • Qwen2.5-VL-7B zero-shot: average static monitoring (T1,T2,T4,T10) accuracy ~34%–42% vs ProcessData-SFT-Qwen: 58.5%–92.5%
  • InternVL-3-8B zero-shot: primitive-local progress (T5) 34.1% vs ProcessData-SFT-Intern: 45.3%
  • Temporal ordering (T8) zero-shot best 22.3% vs after SFT no meaningful improvement (~17–22% remains near chance)
  • Motion-state recognition (T6) zero-shot best ~69.2% vs SFT ~92.4%
  • Primitive chain restoration (T12) zero-shot up to 91.3% vs SFT >95%
  • Outcome prediction (T7) zero-shot ~54–62% vs SFT 63–66%, small improvements only

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13040.

Fig 1

Fig 1: Overview of RoboProcessBench. RoboProcessBench evaluates process-aware manipula-

Fig 2

Fig 2 (page 3).

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

  • RoboProcessBench focuses on diagnostic process-aware visual question answering, not closed-loop robotic control or end-to-end policy evaluation.
  • Multiple-choice QA accuracy is an indirect proxy for process understanding; real-world robotic deployment requires further integration and validation.
  • Temporal reasoning and outcome prediction remain largely unsolved, indicating current datasets and model architectures may not capture longer-term dependencies or causal structure.
  • Dataset sources and tasks cover limited robotic embodiments, viewpoints, and modalities; extension to failure-rich, multi-robot scenarios is needed.
  • Manual audit is limited to small subsets; some labels rely on automated annotation pipelines, introducing potential label noise or bias.
  • Vision encoders were frozen during fine-tuning; joint vision-language adaptation might yield different results but was not explored.

Open questions / follow-ons

  • How can VLM architectures be enhanced to better capture long-horizon temporal dependencies and causal relations to improve temporal ordering and outcome prediction?
  • What role could explicit memory mechanisms or temporal consistency objectives play in overcoming the current dynamic reasoning bottlenecks?
  • Can multi-modal integration of additional sensor modalities (e.g., force/torque, proprioception) synergize with vision-language inputs to improve process understanding?
  • How do fine-tuned process-aware VLMs perform when integrated into closed-loop robotic control pipelines for real-time progress monitoring and failure prediction?

Why it matters for bot defense

For practitioners in bot defense and CAPTCHA-like human verification in robotic manipulation or interactive systems, RoboProcessBench represents a crucial advancement in benchmarking fine-grained process understanding beyond outcome-based metrics. Its diagnostic granularity can help identify specific VLM weaknesses in interpreting ongoing physical interactions, which is critical when designing robust, real-time verification of legitimate user manipulations versus automated or adversarial bot processes.

The learnable supervision signals and post-training results suggest that structured, physically grounded QA can be used to adapt VLM-based judge models for richer process validation, potentially improving detection of subtle failures or bot-like deviations in multi-step tasks. However, remaining challenges in temporal reconstruction and outcome prediction underscore the importance of careful temporal modeling and evaluation for any CAPTCHA system relying on VLM feedback to verify ongoing process authenticity.

Cite

bibtex
@article{arxiv2606_13040,
  title={ RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation },
  author={ Dayu Xia and Yue Shi and Yao Mu and Huiting Ji and Chaofan Ma and Yingjie Zhou and Hua Chen and Yang Liu and Jiezhang Cao and Guangtao Zhai },
  journal={arXiv preprint arXiv:2606.13040},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13040}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution