RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

Source: arXiv:2605.30326 · Published 2026-05-28 · By Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi et al.

TL;DR

RoboWits addresses a fundamental gap in robotic manipulation benchmarks by targeting the evaluation of cognitive reasoning, creative tool use, and robustness under unexpected challenges rather than focusing solely on skill-level execution. The paper introduces a novel bi-manual robotic benchmark consisting of 30 seed tasks and 208 mutated tasks designed to systematically test geometry-, material-, and assembly-based reasoning at graded difficulty levels. To enable this scale and diversity, an automated multi-agent task generation pipeline is proposed, which generates, mutates, verifies, and instantiates tasks with ground-truth evaluation metrics in a physics simulator.

Empirical evaluation of state-of-the-art Vision-Language-Action (VLA) models, imitation learning baselines, and modular planners show a significant performance gap between solving baseline seed tasks and their mutated variants with unexpected constraints. While VLA models can leverage pretraining to achieve limited success on seed tasks, their performance often collapses on mutation tasks that require adaptation, creative problem solving, and robust strategy revision. Modular planners with oracle state access outperform but still fail to generalize well to mutations without real-time replanning. RoboWits thus reveals that current robotic methods lack true reasoning-driven adaptability in deceptive or constrained environments, providing a rich benchmark to diagnose and advance such capabilities.

Key findings

RoboWits benchmark contains 30 seed tasks and 208 mutated tasks spanning geometry-, material-, and assembly-based reasoning, with difficulty scores from 1 to 5.
Mutation strategies pivot, trap, and add substantially increase task difficulty as confirmed by difficulty score deltas: pivot mean Δ=0.26, trap Δ=0.42, add Δ=0.09 (Table 2).
Pre-trained VLA model π0 achieves 60.8% average success rate on seed tasks but drops to 44.4% on mutated tasks, showing brittleness under unexpected conditions (Table 3).
Pure imitation learning baseline ACT performs poorly, with 11.0% success on seed tasks and 4.8% on mutations, highlighting reliance on vision-language pre-training.
Oracle-state VLM planners achieve up to 52.5% success on seed tasks but show significant performance degradation on mutated tasks (~29% drop without closed-loop feedback).
Closed-loop VLM planners with re-planning outperform open-loop by ~27.4 percentage points on seed tasks and ~29.3 points on mutations, emphasizing the importance of reactive adaptation (Fig 7).
VLAs struggle disproportionately on tasks requiring material interaction and assembly reasoning versus simpler geometric tasks.
The task generation pipeline can produce diverse, high-fidelity reasoning-centric robotic tasks with minimal human verification of only 30 unique evaluation scripts.

Threat model

The adversary is the environment or task conditions presenting deceptive or unforeseen constraints such as altered object materials, fixed or occluded containers, traps, and distractors. They obstruct naive or previously learned solution strategies to test the robot's ability to reason, adapt, and creatively re-plan under unexpected challenges. The adversary is not a malicious agent but an implicit source of uncertainty that the robot must overcome with cognitive flexibility and robust manipulation. The robot does not have prior knowledge of these mutations and must infer them online.

Methodology — deep read

The authors target the threat model of robot manipulation in unconstrained real-world environments where unexpected changes or deceptive conditions may invalidate naive solutions. They assume the adversary is the environment imposing unforeseen constraints or misleading object properties (e.g., fixed containers, narrow openings) that require cognitive reasoning and adaptive tool use. The robot is evaluated on zero-shot or fine-tuned settings without prior knowledge of specific mutations.

Data provenance stems from automated generation of 30 high-level seed tasks covering geometric, material, and assembly reasoning, each defined by natural language instruction, object attributes with function and appearance properties, 3D scene configurations, and executable evaluation metrics coded as binary success and continuous progress scores. These seed tasks are expanded into 208 mutated tasks through three mutation strategies: pivot (changing core objects or constraints to require novel solutions), trap (adding deceptive but unhelpful objects with similar appearance), and add (adding distractor objects to increase scene clutter).

Task generation is performed by a cooperative multi-agent pipeline where foundation model-powered agents independently generate seed task specs, verify feasibility and necessity of reasoning, generate metrics, mutate tasks along defined strategies, verify mutations, and instantiate tasks in a physics simulator (Genesis) with stable object placement and physical properties including diverse materials like fluids and soft bodies.

Robotic models are evaluated on both seed and mutated tasks using a bi-manual robot simulation (two 7-DoF arms with grippers, multiple cameras) with randomized object positions. Baselines include imitation-learned policies trained on 50 demonstrations (ACT), pre-trained VLAs (π0 and π0.5), and modular planners employing large language models (e.g., GPT-4o) for high-level planning with or without oracle object states and perfect low-level controllers.

Training protocols fine-tune models on 50 human teleoperated demos per 10 seed tasks in a multi-task fashion. Testing involves 50 randomized trials per seed task and 5 trials per mutation task to evaluate zero-shot generalization. Metrics include success rate and average progress scores. Ablations analyze closed-loop replanning benefits in modular planners.

The pipeline code, task specs, demonstration data, and evaluation scripts are publicly released for reproducibility, enabling extension and future research. However, some dataset 3D assets use licensed or proprietary models.

A concrete example is the "retrieve cube" seed task, where a cube trapped inside a narrow container must be retrieved and placed on a target area. Mutation strategies create variations like fixing the container to the table (pivot), adding water cups as traps, or distracting objects. The models are then tested on their ability to revise strategies beyond direct grasping, e.g., pouring or flushing out the cube creatively.

Technical innovations

A multi-agent automated task generation pipeline that cooperatively generates, mutates, verifies, and instantiates diverse bi-manual reasoning-centric robotic tasks at scale.
Formalizing manipulation tasks via a unified schema capturing geometric, material, and assembly reasoning with natural language instructions and structured object attributes.
Three mutation strategies (pivot, trap, add) designed to systematically increase reasoning complexity and deceptive challenges in robotic tasks.
Amortization of metric generation across mutation families, enabling scalable, high-fidelity task evaluation with minimal human verification effort.
Integration of vision-language models with modular planners incorporating oracle state feedback and closed-loop re-planning to improve reasoning-driven robotic problem solving.

Datasets

RoboWits — 208 tasks including 30 seed and 178 mutated tasks — publicly released with 50 human teleoperated demonstrations for 10 seed tasks

Baselines vs proposed

ACT (imitation learning baseline): Success Rate seed = 11.0%, mutation = 4.8% vs proposed VLM Planner with oracle state: Success Rate seed = up to 60.8%, mutation = 44.4%
π0 (pre-trained VLA): Success Rate seed = 60.8%, mutation = 44.4% vs ACT: seed 11.0%, mutation 4.8%
VLM Controller (oracle planning + scripted controller): seed success up to 24%, mutation success below 10%, worse than VLM Planner with perfect primitives
VLM Planner: closed-loop replanning success higher by 27.4 percentage points on seed and 29.3 points on mutations compared to open-loop (Fig 7)
π0.5 (improved pre-trained VLA): seed success 12.6%, mutation success 9.8%, slightly better than π0 but still low on mutations

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30326.

Fig 1

Fig 1: Creative problem solving under unexpected challenges. The figure contrasts the

Fig 2

Fig 2: Overview of our automated task generation pipeline, where foundation model-powered

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The evaluation focuses primarily on manipulation tasks within a tabletop bi-manual setup, limiting generalization to mobile or more complex robots.
Simulated environments may not capture full real-world physical noise, sensor uncertainty, and object variability.
Adversarial robustness mainly tested through predefined mutations; no comprehensive adversarial policy attacks or dynamic environmental changes were studied.
Limited number of human demonstrations (50 per task) may constrain training of large models, especially for zero-shot generalization.
While mutation strategies increase challenge, they rely on manual design constraints and heuristics which may limit diversity of unexpected environmental conditions.
Oracle-state planners assume error-free perception and action primitives, unrealistic in physical robot settings.
The benchmark currently spans a predefined set of reasoning categories; other cognitive skills like long-horizon planning or social interaction are out of scope.

Open questions / follow-ons

Can vision-language-action models be enhanced with explicit failure detection and recovery mechanisms to improve robustness to mutations?
How can real-world robot sensory noise and physics unpredictability be incorporated into RoboWits to better bridge sim-to-real gaps?
What role could unsupervised or reinforcement learning play in enabling zero-shot generalization to unexpected task mutations without fine-tuning?
How might multi-agent collaboration or external tool-use augmentation further expand reasoning and creative problem solving in such benchmarks?

Why it matters for bot defense

For bot-defense and CAPTCHA designers, RoboWits highlights the current brittleness of embodied agents when facing unexpected environmental constraints and deceptive inputs. Similar to CAPTCHA challenges that aim to expose superficial pattern recognition by bots, this benchmark exposes robotic policies overly reliant on rote skill execution without deeper reasoning or adaptation. Insights from RoboWits task mutation strategies could inspire adversarial challenge generation for CAPTCHA systems, emphasizing unexpected perturbations or traps that require genuine problem solving rather than memorized responses. The evaluation framework also reinforces the importance of closed-loop feedback and real-time adaptation, which parallels bot-defense mechanisms that rely on dynamic challenge adjustments to foil automated solvers. However, direct application requires careful translation from robotic physical reasoning to language or image-based CAPTCHA domains.

Cite

bibtex

@article{arxiv2605_30326,
  title={ RoboWits: Unexpected Challenges for Robotic Creative Problem Solving },
  author={ Chunru Lin and Hongxin Zhang and Fenghao Yu and Zhehuan Chen and Thomas L. Griffiths and Yejin Choi and David Held and Chuang Gan },
  journal={arXiv preprint arXiv:2605.30326},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30326}
}

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​