CLI-Anything: Towards Agent-Native Computer Use

Source: arXiv:2606.03854 · Published 2026-06-02 · By Yuhao Yang, Tianyu Fan, Chao Huang

TL;DR

CLI-Anything addresses a fundamental misalignment in current AI-based computer-use agents that control software primarily through brittle GUI interactions. The dominant paradigm forces agents to mimic human visual perception and mouse-clicking, which is fragile due to pixel-level dependencies, timing issues, and interface changes. Instead, CLI-Anything proposes an "agent-native" computer interface design, where existing applications are lifted into command-line harnesses exposing machine-readable protocols: structured commands, explicit state, deterministic execution, and programmatic feedback. This transformation eliminates lossy visual interpretation and aligns with agents' strengths in structured data processing and precise programmatic control. The paper details the methodology, architecture, verification framework, and ecosystem infrastructure (CLI-Hub) to operationalize this vision. CLI-Anything thus enables AI agents to interact with professional software environments more robustly, facilitating reliable state management, artifact-backed execution, and recoverable workflows. A detailed case study on Blender and other applications demonstrates feasibility and benefits.

Key findings

Generated CLI harnesses expose explicit contracts comprising state, commands, inspection, rendering, verification, and discovery layers (S, C, I, R, V, D).
CLI-Hub catalog currently includes 65 CLI-Anything harnesses across 29 categories and 18 third-party public CLI entries, supporting ecosystem growth.
Blender harness exposes 54 CLI commands in 12 groups, allowing full scene construction, manipulation, and backend rendering with 228 verification tests.
Harness verification tests perform end-to-end validation by generating real output files (e.g., images, documents) and checking structural correctness (e.g., ZIP, pixel-level analysis).
The preview protocol publishes stable on-disk bundles with artifact manifests, summaries, previews, and replay trails, enabling consistent human and agent views.
Stateful REPL and session management enable agents to inspect project state, perform undo/redo, and recover from intermediate errors, improving over stateless CLI approaches.
CLI harness commands support JSON output for programmatic inspection and error reporting, facilitating robust agent interactions.
Using native backend APIs or scripting interfaces (e.g., Blender's bpy) avoids flaky GUI automation and supports deterministic execution.

Threat model

The adversary is implicitly the brittleness and fragility inherent in pixel-level GUI agent interactions driven by ambiguous visual perception and timing dependencies. This paper's approach assumes the software backend is trusted and accessible via programmatic interfaces, removing opportunities for adversaries inducing incorrect visual states or interface drift. The model does not address intentional adversarial manipulation of backend software or malicious artifacts, focusing instead on robustness against interface brittleness and observation noise.

Methodology — deep read

Threat model & assumptions: The paper assumes an adversarial environment where AI agents operate uncontrollably brittle pixel-based GUIs. Instead, CLI-Anything treats real software backends as trusted execution units, providing stable APIs and state. The adversary is not explicitly described but the approach mitigates brittleness stemming from GUI changes and timing.
Data & provenance: The authors build CLI harnesses over real professional applications (Blender, LibreOffice, Draw.io, FreeCAD, Shotcut, etc.). Harness extraction involves interface archaeology to identify native backends, command systems, scripting APIs, or file formats. The harnesses expose persistent project state, commands, and preview artifacts. The ecosystem data includes CLI-Hub registries cataloging 65 internal harnesses plus 18 public CLIs.
Architecture/algorithm: CLI harnesses implement a contract H = (S, C, I, R, V, D) where S is explicit persistent state (project files, undo history), C is typed domain commands grouped semantically, I is inspection APIs returning JSON status or schema, R delegates rendering/export to real backends, V runs unit and end-to-end tests verifying output artifacts, and D supports discovery via metadata and install paths. Harness CLI shells provide a REPL and one-shot commands with JSON response modes for programmatic agent use. Commands mutate state, run backend invocations, and publish preview bundles following a preview protocol. The harnesses tightly integrate real backend tools (e.g., Blender’s bpy, LibreOffice headless) ensuring rendering truthfulness.
Training regime: No training is involved; rather the methodology centers on a repeatable harness-generation Standard Operating Procedure: locate backend contract via interface archaeology, model domain nouns and commands, design explicit typed command grammars and recoverable errors, implement state handling and backend API calling, render/export and generate preview artifacts, verify correctness via tests (both unit and E2E), and publish skill metadata for discovery.
Evaluation protocol: The authors evaluate with extensive testing of harness commands ensuring they produce valid artifacts checked via file content rules, magic bytes, pixel/media verification, and process exit codes. The paper reports the Blender harness with 228 tests and detailed coverage of commands. Preview bundles enable synchronized human and agent inspection. Usage statistics from CLI-Hub logs illustrate increasing agent-driven invocations. Cross-application examples showcase the generality of the approach across CAD, office, media editing, and debugging tools.
Reproducibility: The CLI-Anything ecosystem and CLI-Hub repositories are publicly released with harnesses, generated skills, skill metadata files, previewing tools, and registration infrastructure to enable third parties to add new harnesses. The system encourages community contributions and transparency through standardized contracts. The paper provides sample code snippets and architecture diagrams illustrating pipeline components from harness lift to preview publication.

Example end-to-end execution: For Blender, the harness exposes commands grouped semantically by scene, object, material, modifier, camera, light, animation, rendering, preview, and session management. A high-level agent command mutates a JSON scene graph representing the project state. This triggers generation of a bpy Python script that rebuilds the scene in headless Blender, which is then invoked to produce real render outputs. Preview bundles are built from rendered hero and workbench images plus manifest files, enabling subsequent agent inspection. Verification tests confirm outputs have expected file structure and content. Undo stacks and session JSON ensure recoverable stateful workflows.

Technical innovations

Formulation of a comprehensive agent-facing software contract (S, C, I, R, V, D) that elevates CLI harnesses beyond GUI automation.
Systematic harness-generation SOP combining interface archaeology, explicit typed CLI design, deterministic backend execution, preview bundles, and verification tests.
Preview-as-display protocol separating backend truth generation from stable visual artifacts, enabling synchronized read-only human and agent inspection.
Integration of stateful command-line REPL environments with project/session state, undo/redo stacks, and dry-run modes to support recoverable, long-horizon agent use.
CLI-Hub infrastructure providing discovery, installation, registry, and metadata management for scalable deployment of CLI harnesses.

Baselines vs proposed

Prior GUI agent methods: brittle pixel-level control causing fragility vs CLI-Anything: deterministic command execution and structured state exposing stable agent contracts.
CLI-Hub current entries: 65 CLI-Anything harnesses vs 18 third-party CLIs available for discovery and installation.
Blender harness: 54 CLI commands, 12 groups, 228 verification tests demonstrating full-scene manipulation and rendering pipelines.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03854.

Fig 10

Fig 10: Midpoint frames from real manipulation demos. Each frame is extracted from a recorded run and

Fig 12

Fig 12: CLI-Hub usage mix and trend. Agent calls dominate the current call mix, the total call volume has

Fig 11

Fig 11: Top combined categories across the harness and public CLI registries. The current catalog spans AI,

Limitations

CLI-Anything relies on availability of stable native backends, scripting APIs, or file-format access, limiting applicability to software without such exposure.
Visual feedback is provided via previews but GUI event streaming or real-time visual interactions remain outside the command-line contract.
Verification focuses on artifact correctness but may miss semantic errors not captured by current tests.
CLI-Hub infrastructure currently partly human-facing; agent-first discovery, install, and error recovery remain future work.
Some software domains with opaque state or non-deterministic outputs may require complementary GUI automation.

Open questions / follow-ons

How to extend agent-native interfaces to software lacking comprehensive CLI, API, or scripting support?
Can the preview protocol and verification layers be expanded to cover semantic correctness beyond structural checks?
What is the optimal design for agent-first CLI-Hub discovery, installation, and fault recovery protocols?
How might agent-native interfaces integrate multimodal feedback including real-time GUI events and visual stream alongside command protocols?

Why it matters for bot defense

CLI-Anything’s approach highlights a fundamental shift relevant to bot defense and CAPTCHA domains: rather than forcing automated agents to operate by mimicking human GUI interaction—which is often brittle and error-prone—designing interfaces tailored to agent strengths (programmatic control, explicit state, deterministic feedback) can greatly improve reliability and robustness. For CAPTCHA designers, this work underscores the potential of moving beyond pixel-based heuristics or visual puzzles by exploiting API-level or command-line-like protocols for access control or challenge generation. Conversely, bots leveraging agent-native command interfaces could evade detection strategies focused on GUI event patterns. Thus, bot-defense practitioners should anticipate an evolution toward more sophisticated backend-side verification and protocol-level defenses as agents gain native software interfaces supporting structured, testable operations. CLI-Anything provides practical guidance for transitioning software ecosystems from visual to semantic interaction layers, a potentially critical inflection point for future bot and CAPTCHA research.

Cite

bibtex

@article{arxiv2606_03854,
  title={ CLI-Anything: Towards Agent-Native Computer Use },
  author={ Yuhao Yang and Tianyu Fan and Chao Huang },
  journal={arXiv preprint arXiv:2606.03854},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03854}
}

CLI-Anything: Towards Agent-Native Computer Use ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​