TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Source: arXiv:2605.22535 · Published 2026-05-21 · By Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng et al.

TL;DR

TerminalWorld addresses the challenge of reliably evaluating autonomous agents on authentic, real-world terminal command-line interface (CLI) tasks. Existing benchmarks rely heavily on manually curated, expert-authored tasks that often emphasize contrived, adversarial puzzles rather than reflecting the evolving daily workflows developers actually perform. TerminalWorld introduces a scalable data engine that ingests over 80,000 publicly shared terminal session recordings from asciinema, automatically reverse-engineers these into 1,530 high-fidelity executable tasks spanning 18 diverse categories covering short commands to workflows exceeding 50 steps. A subset of 200 tasks undergo thorough manual verification for benchmarking frontier large language models (LLMs) and terminal agents. Results across eight models and six agents reveal significant difficulty remains in these real-world terminal workflows — top agents solve only 62.5% of tasks — with execution cost rising despite limited performance gains. Furthermore, TerminalWorld captures distinct capabilities unseen in prior expert benchmarks, which show only weak correlation (Pearson r=0.20) to TerminalWorld performance. By distilling authentic human usage and rigorously reproducing environments with executable tests, TerminalWorld offers a novel, scalable benchmark to measure and guide agent progress on naturalistic CLI problem solving.

Key findings

TerminalWorld reverse-engineers 1,530 validated tasks from 80,870 raw asciinema recordings, spanning 18 real-world categories with 1,280 unique commands (91% absent from Terminal-Bench).
From this, 200 tasks form a VERIFIED subset manually audited by experts for correctness and environment reproducibility.
Among eight frontier LLMs tested with the Terminus-2 scaffold, the highest pass rate is 62.5% (Claude Opus 4.7), with scores averaging 49.0% to 62.5%.
Open-weight models (e.g. Kimi K2.6, GLM 5.1) achieve similar or better pass rates than closed models, but with 4-8x lower compute cost per pass ($0.11-$0.20 vs $0.51-$0.94).
There is a negative correlation between effort (turns, tokens used) and success (Pearson r = -0.49 and -0.62), revealing an efficiency paradox of more compute providing diminishing returns.
Performance across TerminalWorld tasks only weakly correlates with scores on expert-curated Terminal-Bench (Pearson r=0.20), indicating distinct capability requirements.
Terminal agents commonly find alternative valid command sequences rather than replicating original human workflows, with median command-set overlap just 21.4%.
Agent framework choice affects cost-effectiveness but not maximum capability, implying practical gains need minimizing exploration friction over increasing model complexity.

Threat model

The adversary is effectively the complexity and noisiness of real-world terminal environments that the agent must autonomously operate within. The agent must handle unknown system states, hidden dependencies, transient resources, and multi-step workflows without privileged oracle knowledge. The threat model assumes no direct manual intervention during task execution; however, adversarial human interference or targeted evasion attacks are not addressed.

Methodology — deep read

Threat Model & Assumptions: The work assumes a black-box real-world adversarial environment where agents must autonomously issue shell commands and interpret outputs in authentic terminal workflows. The adversary is effectively the complexity and noisiness of genuine developer environments and workflows, including hidden dependencies and long multi-step commands. The agent does not have privileged access to oracle solutions beyond observations.

Data: They harvest 80,870 terminal session recordings from asciinema.org, containing transcripts of commands and outputs. Extensive filtering removes recordings with sensitive info (PII), terminal UI or GUI apps, unreproducible workflows (e.g. Windows, proprietary), trivial or exploratory sessions, yielding 9,492 high-quality CLI workflows.

Architecture/Algorithm: TERMINALWORLD is a pipeline composed of four automated stages:

Synthesizing tasks by using LLMs (Claude Sonnet 4.6) to parse transcripts and infer outcome-oriented instructions (goal statements) and clean reference bash scripts, removing typos, retries, and irrelevant outputs.
Reproducing executable Docker environments by using an LLM agent (Claude Code) to infer required dependencies, generate Dockerfiles and docker-compose files, iteratively refining these via build and runtime errors.
Generating executable test suites within Docker containers by capturing pre- and post-execution filesystem snapshots and synthesizing test assertions to check persistent outputs against instructions. A trial-based feedback loop tests the suite by running the reference solution, empty runs, and partial runs to eliminate false positives/negatives.
Verification and selection to distill valid tasks for final benchmark.

Training Regime: Not applicable, as this is a benchmark dataset and evaluation platform rather than a learned model training paper.

Evaluation Protocol: They evaluate eight state-of-the-art LLMs (including Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) and six terminal agents (Terminus-2, Claude Code, mini-SWE-agent, OpenHands, Codex CLI, Gemini CLI) on the 200 manually verified subset. The standard Harbor evaluation harness orchestrates execution in isolated Docker containers and measures pass rates, average turns, tokens, time, and compute cost. Pass/fail decisions rely on the executable test suites.

Reproducibility: Code and data are publicly released at their GitHub repository. However, some underlying LLMs (e.g., Claude Opus) are closed-source. Docker environments for tasks are reproduced in the pipeline.

End-to-End Example: For a recording capturing 'block all IPs with over 10 failed ssh logins in a log file,' they parse transcript to instruction and clean bash commands (grep, awk, ufw commands). Using the reference script, they synthesize a Docker environment with needed tools, replay the script, capture file output changes, and generate tests to validate that output matches the specified block list. The test suite is refined to avoid brittle string matching by testing file contents structurally. Only when all tests pass/across ablation trials is the task admitted to the benchmark.

Technical innovations

Automated pipeline to reverse-engineer authentic terminal tasks from large-scale in-the-wild terminal recordings rather than manual curation.
Combining LLMs with an execution-feedback loop to iteratively reconstruct and verify reproducible Docker environments for terminal workflows.
Trial-based test suite generation refining tests via dynamic state inspection and multiple execution conditions (full, partial, nop) to minimize false positives/negatives.
Systematic statistical analysis showing a pronounced efficiency paradox where more compute used by agents correlates with diminishing returns on task success.
Demonstration that real-world terminal tasks require diverse unique commands (1280+) and complex multi-step workflows underrepresented in existing benchmarks.

Datasets

TERMINALWORLD full benchmark — 1,530 validated tasks — synthesized from 80,870 public asciinema recordings
TERMINALWORLD-VERIFIED subset — 200 manually reviewed, rigorously checked tasks — curated from full benchmark

Baselines vs proposed

Claude Opus 4.7 (Terminus-2 scaffold) pass rate = 62.5% vs GPT-5.5 53.5%
Open-weight Kimi K2.6 pass rate = 57.5% at $0.15/$ pass vs GPT-5.5 $0.94/$ pass
Agent Terminus-2 (Claude Opus 4.7) pass rate = 62.5% vs Claude Code agent (same model) 58.0%
Agent Terminus-2 (Gemini 3.1 Pro) pass rate = 55.0% vs Gemini CLI agent 56.0%
Terminal-Bench expert benchmark scores correlate weakly with TerminalWorld performance (Pearson r=0.20)
Failed agent attempts consume 3.3× more tokens and 1.4× more time than successful attempts, dominating evaluation cost

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22535.

Fig 1

Fig 1: An overview of the TERMINALWORLD pipeline. Our data engine automates terminal

Fig 2

Fig 2: Statistical comparison of 1,530 TERMINALWORLD tasks and 241 unique Terminal-

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Evaluation limited to Linux-based CLI terminal workflows; excludes Text User Interfaces and GUI interactions.
Some reproducible environments discarded due to inaccessible external resources or Docker incompatibilities.
Test suite generation relies on LLM agents and execution feedback but may still miss subtle failure modes or brittle tests.
Closed-source models evaluated include proprietary LLMs, limiting full reproducibility of results.
Efficiency paradox and agent failures indicate room for progress but also suggest current LLM-based agents lack robust long-term planning.
Benchmark coverage, while diverse, is constrained to publicly shared workloads on asciinema and thus reflects a biased subset of developer workflows.

Open questions / follow-ons

How can terminal agents overcome the efficiency paradox and improve long-horizon planning and stopping criteria to reduce wasted compute?
Can the automated task synthesis pipeline be extended to incorporate and evaluate TUI or GUI interactions within terminal environments?
What techniques could close the gap between the high diversity of real-world commands and the agents' ability to discover and leverage them efficiently?
How does the benchmark generalize to private or enterprise terminal workflows not publicly shared and potentially involving sensitive, proprietary tools?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners developing or analyzing automated agents interacting with command-line environments, TerminalWorld emphasizes the complexity and diversity of authentic real-world CLI workflows agents must handle. It shows that existing expert-curated benchmarks poorly predict real-world performance, highlighting the importance of grounding evaluations in genuine usage data. The scalable pipeline provides a methodology to generate and maintain evolving task benchmarks as attacker tooling and developer practices change. Furthermore, the observed efficiency paradox illustrates that agent cost and interaction complexity must be balanced carefully to avoid wasted resource consumption without better task success. While CAPTCHAs do not typically interface directly with terminal agents, the principles of synthesizing authentic tasks and rigorous test oracles informed by execution feedback apply to building robust challenge-response tests that resist automation. Overall, practitioners should consider TerminalWorld as a model for constructing realistic, large-scale evaluation protocols for agents operating in complex CLI or otherwise stateful, multi-step environments.

Cite

bibtex

@article{arxiv2605_22535,
  title={ TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks },
  author={ Zhaoyang Chu and Jiarui Hu and Xingyu Jiang and Pengyu Zou and Han Li and Chao Peng and Peter O'Hearn and Earl T. Barr and Mark Harman and Federica Sarro and He Ye },
  journal={arXiv preprint arXiv:2605.22535},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22535}
}

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​