MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Source: arXiv:2606.03203 · Published 2026-06-02 · By Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

TL;DR

MedCUA-Bench addresses the critical gap in evaluating computer-use agents (CUAs) on clinical graphical user interfaces (GUIs), which differ substantially from mainstream web or desktop applications in complexity, domain specificity, and safety requirements. The benchmark reconstructs 18 diverse clinical scenarios from real product manuals and open-source medical software spanning 10 medical domains, providing authentic GUI environments without privacy or licensing conflicts. Each task is specified with paired high-level (intent) and low-level (stepwise) goals to separate clinical reasoning from UI manipulation, and employs a deterministic safety-aware checker that evaluates not only task completion but also five clinically relevant safety dimensions, capturing subtle but serious failure modes that previous benchmarks miss.

Empirical evaluation of 23 vision-capable agents, including six closed-source and thirteen open-source models, reveals a significant reliability gap. The best closed-source agent attains only 54.2% strict success overall and below 9% on the real deployed OpenEMR system. Open-source agents average a mere 2.5% success, with the best at 16.2%. These results pinpoint large differences in pixel-level grounding and clinical UI control capabilities, with agents often exhausting interaction budgets or failing to follow complex clinical workflows safely. MedCUA-Bench thus provides a rigorous, reproducible testbed that exposes bottlenecks in clinical software automation and highlights the need for advances in domain grounding, safety validation, and multi-modal interface comprehension.

Key findings

Best closed-source model (GPT-5.4) achieves 54.2% strict task success on 432 clinical tasks; mean closed-source success is 33.7%, mean open-source success is only 2.5%.
On the real OpenEMR EHR system, all agents score below 9% strict success, with GPT-5.4, Claude-Opus-4.7, and Claude-Sonnet-4.6 tied at 8.3%.
Open-source model Qwen2.5-VL-32B achieves 16.2% success, outperforming larger models like Qwen2.5-VL-72B-AWQ (6.2%) and newer Qwen3.5 models (~2%).
Synthetic HTML-based scenarios yield higher success (up to ~54% for closed-source) compared to lower success on real systems, highlighting gap due to UI complexity and server-side validation.
Paired goal design shows that explicit step-level procedural guidance improves closed-source models’ success by 3.3-5.1 percentage points but often harms open-source models due to prompt length.
No critical safety violations recorded over 9,936 episodes; major and minor violations concentrated mostly in open-source agents.
The main failure mode across agents is step-budget truncation (30-step limit), not unsafe or incorrect terminations; strongest agents time out in ~43% of episodes.
Failure mode analysis shows that closed-source agents fail mainly due to exploration timeouts, open-source agents due to format lockouts or heavy repetition loops, indicating instruction following vs pixel grounding bottlenecks.

Threat model

n/a — The work does not address adversarial or malicious threats but rather aims to benchmark autonomous agents operating on clinical GUIs using screenshots only, under realistic deployment constraints (no privileged API or DOM access). The focus is on assessing reliability and safety failures arising from imperfect UI understanding and control by agents with limited modality, not active adversaries.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly defined as malicious; rather, the evaluation focuses on autonomous computer-use agents operating from screenshots only, without access to internal DOM, API, or accessibility trees. Agents must extract intent and execute GUI actions reliably without privileged access, simulating likely real-world API-restricted clinical environments.
Data: MedCUA-Bench comprises 18 clinical scenarios across 10 medical domains (Outpatient, Inpatient, ICU, Nursing, Imaging, etc.) with a total of 432 tasks. Fifteen scenarios are synthetic HTML GUI reconstructions derived from real vendor manuals and workflow specs to replicate authentic UI layouts and interaction patterns without proprietary assets. One scenario runs OpenEMR v7.0.2 in Docker, seeded with five synthetic patient records. Two scenarios use the OHIF Viewer connected to public DICOMweb radiology/pathology data. All task content is synthetic or open-source, with no real patient data.
Architecture / Algorithm: The benchmark interfaces agents via screenshots only. Agents receive a screenshot, a natural-language task goal (either high-level intent or low-level procedural step), a short action history, and error feedback. Action space is low-level mouse clicks with integer screen coordinates, keyboard typing, navigation shortcuts, and messaging outputs. The benchmark uses the BrowserGym pixel interface protocol for interactions and disables privileged metadata layers.
Training Regime: While not detailed as training new models, evaluation runs 23 pretrained vision-capable large language models with varying scales (~7B to 72B parameters), including both closed-source and open-source architectures. Agents run all 432 tasks once, with a 30-step action budget per episode, terminating on task success, critical safety violations, or timeout.
Evaluation Protocol: Metrics include strict task completion measured by a deterministic safety-aware checker evaluating five clinically important safety dimensions: patient identity, data accuracy, information fidelity, record integrity, workflow safety. Severity weights assign penalties (critical=1.0, major=0.3, minor=0.05) that reduce episode rewards. Success requires completing the full task with zero safety violations. Performance is reported per model, per scenario fidelity tier (synthetic, OpenEMR, OHIF), and for both intent-level and step-level goal prompts.
Reproducibility: The full benchmark, including scenario seeds, deterministic checker, environment harness, and synthetic patient data, will be released under a permissive research license. Evaluations reported are on one pass per agent-task pair; run-to-run variability is not reported. Code release status is indicated but exact URLs or repos are deferred.

Concrete Example: For instance, in the OpenEMR FIND_PATIENT task, agents must identify and click through specific UI forms and dialogs on a dense legacy EHR interface based solely on screenshots. GPT-5.4 completes this task 8.3% of the time, often requiring exploration and multiple retries within the 30-step limit, whereas many open-source agents fail to converge, becoming trapped in repeated or invalid clicks, leading to truncation. The deterministic checker audits final form submissions for correct identifiers, verifying clinical correctness and safety dimensions beyond simple UI completion.

Technical innovations

Introduction of MedCUA-Bench, the first screenshot-only interactive benchmark explicitly targeting clinical GUIs with real medical domain complexity and safety assessment.
Paired natural-language goals at two granularities (intent-level and step-level) disentangle clinical reasoning failures from UI execution failures.
A deterministic, safety-aware evaluation checker assesses task success along five clinically critical safety dimensions with severity-weighted penalties, moving beyond binary success metrics.
Combining synthetic HTML reconstructions derived from clinical product manuals with open-source medical software (OpenEMR, OHIF Viewer) to provide a reproducible, realistic clinical software testbed despite proprietary constraints.

Datasets

MedCUA synthetic HTML clinical GUIs — 15 scenarios, 432 tasks — reconstructed from product manuals
OpenEMR v7.0.2 demonstration patients — 1 scenario, 5 synthetic patients — open-source EHR system
OHIF Viewer with DICOMweb endpoint — 2 scenarios — open-source radiology/pathology imaging studies

Baselines vs proposed

GPT-5.4 (closed-source): Strict success = 54.2% vs open-source best Qwen2.5-VL-32B = 16.2%
Claude-Opus-4.7: Success = 52.6% vs GPT-5-Mini = 14.6%
OpenEMR tier (real deployed EHR): best model GPT-5.4 = 8.3% strict success vs synthetic HTML average up to 54.2%
Qwen2.5-VL-72B-AWQ (open-source): 6.2% success vs smaller Qwen2.5-VL-32B: 16.2%
Paired Goal Experiment (step-level) GPT-5.4 success improved from 51.9% to 56.5%; many open-source agents declined in performance under step goals

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03203.

Fig 1

Fig 1: Why medical GUI agents require a dedicated benchmark. General-purpose GUI benchmarks lack realistic

Fig 2

Fig 2: Overview of MedCUA. Clinicians construct environments spanning ten domains and provide two goal

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Synthetic HTML scenarios replicate workflows and terminology but not full vendor UI chrome, so success rates on synthetic tasks may overestimate real-world performance.
The OpenEMR and OHIF tiers offer the most realistic but also hardest evaluation; OpenEMR success rates remain very low (~8.3%) even for top models.
Human baseline pilot was limited (single operator on 24 tasks), leaving full multi-annotator human performance unmeasured.
Each model runs only one episode per task, so observed success rates are point estimates without run-to-run variability or confidence intervals.
The safety checker recorded zero critical violations but is currently under-utilized due to most episodes terminating by step-budget timeouts rather than incorrect completions, thus safety claims are preliminary.
No adversarial evaluation or assessments under UI distribution shifts; robustness to UI changes remains untested.

Open questions / follow-ons

How to improve pixel-level grounding and reliable interface element identification on complex, legacy clinical GUIs with dense information and small targets?
What training or fine-tuning strategies enable better generalization to real deployed EHR systems like OpenEMR with robust server-side validation?
How can safety evaluation be extended to handle confident but incorrect task completions as agent capabilities improve?
What user interaction modalities (e.g., hinting, multi-modal prompts) would improve agent planning and execution in complex clinical workflows?

Why it matters for bot defense

MedCUA-Bench highlights that current vision-language agents still struggle fundamentally with reliable pixel-level grounding and precise control in safety-critical, complex visual interfaces like clinical GUIs. For bot-defense and CAPTCHA practitioners, this benchmark emphasizes the challenges in building reliable multi-modal agents that operate purely from screenshots over heterogeneous, domain-specific UI layouts with dense, sensitive content. It suggests that pixel-level localization errors and subtle failure modes in safety-critical workflows require beyond-binary success metrics, which could translate to more nuanced bot-detection heuristics in CAPTCHA design involving complex task flows, visual context understanding, and multi-step validation. Additionally, the large gap between controlled synthetic environments and real-world deployed systems underscores the importance of evaluating agents on authentic, high-fidelity interfaces rather than simplified proxies.

Cite

bibtex

@article{arxiv2606_03203,
  title={ MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents },
  author={ Jia Yu and Zilong Wang and Xinyang Jiang and Dongsheng Li and Shuo Wang },
  journal={arXiv preprint arXiv:2606.03203},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03203}
}

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​