OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Source: arXiv:2605.19769 · Published 2026-05-19 · By Jinbiao Wei, Qianran Ma, Yilun Zhao, Xiao Zhou, Kangqi Ni, Guo Gan et al.

TL;DR

OpenComputer addresses the challenge of constructing realistic, verifiable software environments for computer-use agents that operate real desktop applications. Existing benchmarks either rely on static datasets or human-curated tasks that are expensive to scale and difficult to verify accurately. OpenComputer introduces a verifier-grounded framework that makes reliable, programmatic verification central to both environment construction and evaluation. It builds app-specific verifiers exposing structured inspection endpoints into real software, then iteratively improves them using execution-grounded feedback from calibrations with strong agents. This approach enables automatic synthesis of 1,000 realistic multi-step tasks across 33 diverse desktop applications requiring precise verification of fine-grained application state.

Experiments reveal that even top proprietary models (e.g., GPT-5.4) struggle to completely solve these challenging tasks, achieving only around 68% success, while open-source models perform notably worse compared to their performance on other benchmarks. Crucially, OpenComputer shows that their hard-coded verifiers correlate far better with human judgments than evaluations relying on large language model (LLM) judges based on screenshots and action traces, exposing the limitations of LLM-based evaluation in desktop agent settings. The self-evolving verification layer significantly improves verifier reliability by repairing checker-side errors through iterative feedback loops. Overall, OpenComputer represents a major step toward scalable, auditable benchmarking for computer-use agents in realistic software environments.

Key findings

OpenComputer supports 33 desktop applications with 1,000 finalized tasks covering browsers, office suites, creative tools, development environments, and communication apps.
Hard-coded verifiers match human labelers on 94.1% of item-level criteria and 113/120 tasks, whereas LLM-as-judge evaluation reaches 92.2% and 95/120 respectively (Fig 3).
Strongest agent GPT-5.4 attains 68.3% full task success and 88.4% average partial reward, completing tasks in 19 steps and 16.5 seconds per step (Table 2).
Open-source agents experience sizeable performance drops compared to OSWorld benchmark; e.g., GUI-OWL-1.5-8B drops from 52.3% OSWorld to 5.7% on OpenComputer.
The self-evolving verification layer identifies and repairs 89.4% of verifier-side errors in calibration tasks within three iterations, boosting human-checker agreement from 85.2% to 94.1% (Table 4).
CLI agents running on a CLI-compatible subset achieve a 67.2% success rate and substantially faster completion time (141s) compared to GUI agents at 73.0% success and 622s runtime (Table 3).
Verification endpoints query reliable native interfaces (e.g., CDP, D-Bus, SQLite) rather than relying on visual heuristics or weak proxies.
The evaluation harness records full interaction trajectories and scores task success as the fraction of programmatic checklist criteria passed, enabling partial credit.

Threat model

The adversary is a computer-use agent interacting with sandboxed real desktop software through available GUI or CLI action interfaces and screenshots. The adversary cannot tamper with the verifier code, initialization scripts, or underlying sandbox state. The evaluation assumes the verifier accurately reflects the true software state and that agents operate without privileged knowledge beyond visible screens and exposed APIs.

Methodology — deep read

The threat model assumes adversaries are computer-use agents interacting with real desktop applications through screenshots and GUI or CLI actions, attempting to complete specified multi-step tasks in a sandboxed desktop environment. Agents gain no privileged knowledge beyond the visible screen and application state exposed via verifiers. Agents cannot alter verifier code or environment snapshots.

OpenComputer constructs tasks τ = (x, e, c) comprising natural language instructions (x), executable environment initializers (e) to prepare realistic state (files, configs, settings), and verifiable success criteria (c). It supports 33 real desktop applications spanning multiple domains. The environment state s0 is initialized from e, and the agent acts to reach terminal state sT.

The core novelty is the app-specific verifier generation. Each application is paired with a Python verifier module (Va) exposing CLI subcommands returning JSON reflecting detailed software state, obtained through native and robust inspection channels like browser debugging protocols (CDP), SQLite profile databases, D-Bus, UNO interfaces, file parsing, accessibility APIs, etc. This verifier is treated as a software artifact with a strict debug-fix-retry cycle involving unit and integration tests with realistic fixtures. Test plans cover positive/negative cases, argument validation, and failure modes.

Next, a self-evolving verification layer refines the verifier. Approximately 15 calibration tasks per app run strong agents in sandboxed environments, recording full action-screenshot trajectories and final states. The verifier's programmatic verdicts on success criteria are compared against a criterion-level LLM judge's reference judgment. Discrepancies attributed to verifier errors lead to iterative verifier improvements (code, endpoints, documentation) with a bounded budget of three iterations.

Task generation synthesizes realistic user goals as candidate tasks, filters them for complexity and generatability, then grounds them with verifier endpoints. If verification coverage is missing for an inspectable outcome, new endpoints are created. The environment is synthesized by packaging required files, folders, configs, and profiles into initialization scripts.

For evaluation, the harness runs each task from a fresh sandbox, records the full screenshot-action sequence, then executes all verifier checks on the final state. The reward is fractional — the fraction of passed checks. Partial credit supports nuanced scoring when subtasks are partially completed. The system supports local or cloud execution with Docker sandboxes for reproducibility.

Example end-to-end: for a task to edit a spreadsheet and save, e launches Excel with seeded files, the verifier exposes UNO interfaces to extract exact cell contents and formatting as JSON, the LLM agent interacts via screenshots and GUI commands, and the verifier checks post-run that target cells contain expected values. Disagreements between verifier and LLM-judge feedback during calibration help fix brittle verifier assumptions. After multiple iterations, verifier outputs align strongly with human labeling, enabling automated large-scale, auditable evaluation.

Training details for agents use diverse proprietary and open-source LLMs interacting with the environments, typically running on GPUs (two H100s for large open-source models). Agents are evaluated over 1,000 finalized tasks with step budgets and time constraints. Metrics include task success rate (all criteria met), average reward (partial credit), average steps, and average time per step. Cross-benchmark comparison is provided against OSWorld-Verified results for context. Statistical analyses include human annotation agreement studies for evaluator calibration.

All code, verifier modules, tasks, and sandboxes are released in an extensible open-source infrastructure supporting task execution, verifier extension, and dataset augmentation. However, some proprietary agent weights and datasets are not public, limiting full reproducibility of model benchmarks. Task environments are deterministic due to sandbox seeding.

Overall, OpenComputer exemplifies a tight integration of verification-grounded environment synthesis, iterative verifier enhancement through agent executions, and rigorous, auditable benchmarking for computer-use AI agents on real desktop software workflows.

Technical innovations

Verifier generation pipeline building app-specific state inspection modules using native application APIs and protocols (e.g., CDP, D-Bus, SQLite) for structured, reliable verification.
Self-evolving verification layer that iteratively improves verifiers by comparing programmatic outputs against criterion-level LLM judgments on calibration executions and repairing checker code and endpoints based on disagreement diagnosis.
Verifier-aware task synthesis that automatically generates multi-step, realistic computer-use tasks grounded in machine-checkable success criteria derived from verifier inspection capabilities.
Partial credit reward computation enabling nuanced agent benchmarking by scoring fraction of verifier checklist items passed rather than binary success/failure.

Datasets

OpenComputer Benchmark — 1,000 finalized tasks — spans 33 desktop applications including browsers, office suites, creative software, development tools, file managers, and communication apps — publicly released.

Baselines vs proposed

GPT-5.4: task success rate = 68.3% vs Claude-Sonnet-4.6: 64.4% vs Kimi-K2.6: 58.8% (Table 2)
GUI-OWL-1.5-8B: OSWorld-Verified 52.3% vs OpenComputer success 5.7%
EvoCUA-8B: OSWorld-Verified 46.1% vs OpenComputer success 10.9%
LLM Judge evaluation alignment with humans: 70.8% task-level agreement vs 94.1% for hard-coded verifier (Fig 3)
CLI agent (Claude Code) success rate 67.2%, faster (141s) vs GUI agents GPT-5.4 73.0% success, 622s runtime (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.19769.

Fig 1

Fig 1: Overview of the OpenComputer verifiable software-world synthesis pipeline. Phase 1 generates app-

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Evaluation focuses primarily on desktop environments under a sandboxed setup; real-world variability and environments with ongoing background processes are not explored.
Calibration and self-evolution of verifiers are based on a limited set (~15) of tasks per app; remaining verifier corner cases may persist.
No adversarial or robustness evaluation of agents or verifiers, such as attempting deception or flaky application states.
Performance variability and stability of open-source agents across different deployments or model versions are not fully characterized.
Some proprietary models and datasets used for benchmarking are closed-source, impacting reproducibility of reported leaderboards.
The task generation pipeline prioritizes multi-step workflows but may still underrepresent very long or highly complex real-world tasks.

Open questions / follow-ons

How to extend verifier-grounded benchmarks to include real-time, non-deterministic desktop environments with background processes and network activity?
Can verifier evolution integrate adversarial testing to proactively discover and fix subtle verification gaps or agent exploits?
How can agent training pipelines integrate the OpenComputer infrastructure to improve robustness and generalization across desktop applications?
What are the scalability limits of automatic verifier generation and tasks synthesis for novel or highly specialized software domains?

Why it matters for bot defense

OpenComputer's approach is highly relevant to bot-defense and CAPTCHA practitioners wanting rigorous, reliable evaluation of autonomous agents performing complex GUI-driven tasks. Its verifier-grounded paradigm ensures that task completion is checked by deep inspection of application state instead of superficial visual heuristics, reducing false positives and adversarial bypasses. Security practitioners can draw parallels in designing verifiable interaction protocols that capture fine-grained state to detect subtle automation failures. The self-evolving verifier method offers a valuable blueprint for continuous improvement of detection heuristics informed by real executions, rather than static rules. While OpenComputer targets desktop automation rather than interactive CAPTCHAs, its emphasis on verifiable environment synthesis and auditable partial-credit scoring provides a framework for crafting robust human-vs-bot challenges requiring multi-step interaction and stateful proof of work. It highlights the importance of coupling environment realism with trustworthy verification to avoid overestimating bot capabilities via weak proxies like screenshots or heuristic rewards.

Cite

bibtex

@article{arxiv2605_19769,
  title={ OpenComputer: Verifiable Software Worlds for Computer-Use Agents },
  author={ Jinbiao Wei and Qianran Ma and Yilun Zhao and Xiao Zhou and Kangqi Ni and Guo Gan and Arman Cohan },
  journal={arXiv preprint arXiv:2605.19769},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.19769}
}

OpenComputer: Verifiable Software Worlds for Computer-Use Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​