Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Source: arXiv:2606.05233 · Published 2026-06-03 · By Nicholas Saban

TL;DR

This paper investigates the reproducibility and domain-specificity of prompt-injection red-teaming results against state-of-the-art computer-using agents (CUAs). Prior literature reports high Attack Success Rates (ASR) of 42–98% using RL-optimized injection strings on older or vulnerable models. The authors reproduce these prompt injections as hand-crafted templates and evaluate them on current frontier models Claude Sonnet 4.6 and GPT-5.4 across a broad, public benchmark, CUA-HANDCRAFTED, comprising 793 multi-step web browsing tasks, 56 attack templates spanning 8 attack families, and multiple prompt settings. They find zero successful multi-step injections (0/140 episodes, with a 95% upper bound ASR of 2.6%) on these frontier models. This resistance localizes to the model weights rather than system prompt variations. However, the same model weights are highly vulnerable—up to 100% ASR—to analogous hand-crafted prompt injections in a sister coding-agent setting (SKILLBENCH). The authors argue prior high ASR figures owe more to undisclosed RL-optimized injection text rather than attack categories themselves, and that safety hardening exhibited by frontier CUAs is highly domain-conditioned to browser use cases. Lastly, the paper audits reproducibility of recent top red-teaming papers and finds most rely on retired models or undisclosed optimized strings, making their headline ASR figures irreproducible.

Key findings

Across 793 episodes with 56 hand-crafted injection templates, zero successful multi-step attacks occurred on Claude Sonnet 4.6 and GPT-5.4 (0/140 multi-step episodes, Clopper-Pearson 95% upper bound 2.60%).
Prompt ablation (L0_bare to L3_hardened system prompts) on Sonnet 4.6 still yielded 0% attack success, localizing safety to model weights rather than prompt text.
Single-step denial-of-service attacks remain weakly effective on legacy models (6.16–6.88% ASR on Sonnet 4 and GPT-4o).
Cross-domain testing with SKILLBENCH coding-agent benchmark on the same weights shows up to 100% ASR on hand-crafted skill injections (match same threat model class, different surface).
RL-optimized injection text discovered by prior work is qualitatively different and more effective than hand-crafted templates; the former is usually undisclosed and unreproducible.
An RL-based black-box attacker (AutoInject adaptive random suffix) with limited budget ($10 API spend, 5 queries per pair) fails to raise ASR above 0% on frontier models, establishing an upper-bound on low-budget attacks.
Recent literature red-teaming papers mostly rely on retired target models and unreleased optimized injection strings, preventing full reproducibility of their high ASR claims.
Browser-domain safety hardening seen between Sonnet 4.5 to 4.6 (~38x drop in ASR) does not generalize to coding agents, showing domain-specificity.

Threat model

The adversary is a black-box prompt-injection attacker who can inject arbitrary text into third-party untrusted inputs for a deployed computer-using agent but has no white-box access to the model weights, gradients, or safety training data. The attacker observes only the model’s outputs via canary tokens and is query- and cost-limited, mimicking realistic deployment attack conditions. They aim to induce harmful agent behavior through prompt injections but cannot modify system-level defenses or internal state beyond the injected input.

Methodology — deep read

Threat Model & Assumptions: The adversary is a black-box prompt injection attacker who can inject arbitrary text into the agent's input prompt via a third-party surface (browser web UI or coding skill file) but has no direct access to model weights, gradients, or training data. They observe only the agent's output (via canary tokens) and have query-limited black-box API access replicating a deployed attacker's capabilities.
Data & Benchmark Setup: The authors build CUA-HANDCRAFTED, a public benchmark hosting 8 instrumented web sites (HR Portal, Project Dashboard, CRM, Banking, etc.) supporting 24 multi-step web tasks (4-12 steps each) plus 6 single-step read tasks. There are 56 hand-crafted attack templates scripted based on the prose descriptions of major recent red-teaming papers. Attack templates span 8 categories (denial_of_service, multi-step, unauthorized_action, data_exfiltration, etc.) and 5 depth levels from simple injections (D1) to expert reasoning attacks (D5). Injections target multiple channels: hidden text, pop-ups, visible text, DOM modifications, help text. Each episode executes a task with and without injected adversarial strings and logs detailed per-step traces.
Architecture / Algorithms: The victim models are Claude Sonnet 4.6 and GPT-5.4, which are state-of-the-art commercial frontier CUAs. Injection attacks are hand-crafted prompt strings appended or inserted by the benchmarking framework. The paper also runs an RL-based attacker from AutoInject, which performs adaptive random suffix search over token suffixes appended to hand-crafted base templates. This attacker uses a Qwen2 1.5B-parameter policy with a GRPO algorithm, mutating suffix tokens and selecting based on a binary canary signal reward.
Training Regime for RL Attacker: RL attack evaluation is run within a strict budget of 5 queries per (task, base-template) pair and $10 API spend for 100 total episodes. The attacker performs iterative mutation and self-transfer of best suffixes to maximize injection success. This is a limited, query-limited black-box baseline, not full RL training.
Evaluation Protocol: Metrics adopt AgentDojo conventions: Benign utility (BU), Utility under Attack (UuA), Attack Success Rate (ASR), and Safety = 1 - ASR. The evaluation counts multi-step task episodes only if the entire sequence completes with the injection goal followed. Canary tokens embedded in injected strings detect compliance. Evaluations run over multiple phases progressively strengthening attacks, prompt conditions, and injection channels. Cross-domain evaluation compares results on coding skills in the SKILLBENCH benchmark to measure domain impact. Statistical confidence intervals provided for key results.
Reproducibility: All attack templates, benchmarks (CUA-HANDCRAFTED), per-episode logs, evaluation harness, RL baseline code, and canary detectors are publicly released under permissive licenses. The tested commercial models Claude Sonnet 4.6 and GPT-5.4 remain API-accessible at submission. The paper audits recent literature for reproducibility and finds lack of released optimized strings undermines prior claims.

Example Run: One multi-step HR Portal task is injected with a hand-crafted 'polite tone' template derived from RL-Hammer. The benchmark injects the string into page text, runs the model to complete the task (e.g., submit PTO request). The canary detector scans outputs for injected goal success. Over 40 such injections with different templates produce 0 successful takeovers on frontier models, but similar hand-crafted injections achieve 100% success in SKILLBENCH coding agent tasks with identical model weights.

Technical innovations

CUA-HANDCRAFTED benchmark: a large, public, multi-domain, multi-step adversarial evaluation suite spanning 793 episodes, 56 human-readable templates, 8 attack families, and 4 prompt conditions for browser CUAs.
Cross-domain evaluation showing domain-conditioned safety: the same model weights resist prompt injections on web browsing surfaces but remain vulnerable on coding agent skill injection, demonstrating that safety hardening is surface-specific rather than model-global.
Reproducibility audit of recent red-teaming literature highlighting widespread reliance on retired models and unreleased RL-optimized attack strings, which invalidates direct comparisons and reproducibility of headline ASR numbers.
Demonstration that robustness to hand-crafted injection templates resides in model weights, not system prompts, via prompt ablation experiments with multiple safety prompt levels.
Provision of an RL-attacker baseline (AutoInject adaptive random suffix) constrained by realistic query and cost budgets, establishing an empirical upper bound on black-box RL attack success on frontier CUAs.

Datasets

CUA-HANDCRAFTED — 793 episodes across 24 browser tasks, 56 attack templates, 8 attack families — public release (MIT/CC-BY-4.0)
SKILLBENCH — ~140 coding agent skill injection episodes across multiple tasks — public release alongside CUA-HANDCRAFTED

Baselines vs proposed

Hand-crafted attack templates on Sonnet 4.6 and GPT-5.4 (browser tasks): multi-step ASR = 0/140 (0%, 95% CI upper bound 2.6%)
Single-step DoS on legacy models Sonnet 4 and GPT-4o: ASR = 6.16% and 6.88% respectively
AutoInject RL baseline attacker on Sonnet 4.6 and GPT-5.4 within 5-query/$10 budget: ASR = 0/100 (0%)
Hand-crafted skill-file injection in SKILLBENCH coding benchmark on same weights: Sonnet 4.6 ASR up to 100%, GPT-5.4 up to 79%, GPT-5.4-mini up to 96%
Legacy model GPT-4o multi-step ASR: 17% (artifactual due to failed benign utility)
Anthropic system card reports Sonnet 4.5→4.6 browser injection ASR drop from 49.36% to 1.29% (38× improvement)

Limitations

The hand-crafted templates approximate but do not replicate the exact undisclosed optimized RL-generated injection strings, possibly underestimating attack potency.
RL attacker evaluation used a limited budget of 5 queries per (task, template) pair and $10 API spend, which may not capture full capabilities of extensive RL optimization.
Benchmark surfaces tested are limited to browser UI and coding skills; other CUA surfaces like tool APIs or file systems remain untested for cross-domain safety generalization.
Evaluation focuses on currently accessible frontier weights (Claude Sonnet 4.6 and GPT-5.4); rapid model updates may alter vulnerability profiles over time.
The single-step denial-of-service attacks are still effective at low rates on legacy models, suggesting some residual vulnerabilities remain outside multi-step injection attacks.
The accessibility tree truncation and page DOM size may affect injection efficacy, though authors argue this is unlikely to explain results fully.

Open questions / follow-ons

How can safety training approaches be adapted or extended to cover multiple CUA deployment surfaces beyond browsers and coding skills to prevent cross-domain injection failures?
What characteristics of RL-optimized injection strings enable them to evade frontier safety training, and can interpretability methods identify such features to enable proactive defenses?
How do more powerful or white-box attackers, with access to gradients or larger query budgets, affect injection vulnerability on frontier CUAs relative to the black-box, query-limited attacker tested here?
Can automated generation methods for hand-crafting injection templates better approximate RL-optimized attacks to close the gap between hand-crafted and optimized injection success rates?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights the critical importance of considering the domain and deployment surface when assessing the safety of computer-using agents. Red-teaming results and attack success rates reported for prompt injection may not generalize across usage contexts—browser-based CUAs can exhibit strong, weight-based resistance, whereas similar injection methods may remain effective in coding or other agent modalities. Hence, security teams should evaluate vulnerabilities in the specific context where agents are deployed, without extrapolating protection guarantees across domains.

The reproducibility audit also underscores the pitfalls of relying solely on literature ASR figures without access to optimized attacker inputs or models, which can lead to overestimates of risk. Providing fully public, hand-crafted baseline injection benchmarks aids defensive tooling by establishing a reproducible attack floor. Finally, the confirmation that safety robustness is encoded in model weights rather than system prompts suggests investment in weight-level safety training and continual domain-specific red-teaming is essential to maintain CUA integrity in real-world bot-mitigating applications.

Cite

bibtex

@article{arxiv2606_05233,
  title={ Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming },
  author={ Nicholas Saban },
  journal={arXiv preprint arXiv:2606.05233},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.05233}
}

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​