WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Source: arXiv:2605.15030 · Published 2026-05-14 · By Tri Cao, Yulin Chen, Hieu Cao, Yibo Li, Khoi Le, Thong Nguyen et al.

TL;DR

WARD addresses the critical challenge of defending autonomous web agents from prompt injection attacks embedded in open web content, which can manipulate agents into executing malicious behaviors. Prior guard models suffer from limited generalization to unseen domains and attack modalities, high false positive rates on benign content, inefficiencies causing inference latency, and vulnerabilities to adaptive adversarial attacks including attacks targeting the guard itself. This work introduces WARD, a practical defense framework combining a large-scale, diverse dataset (WARD-Base) covering 177K samples from 709 URLs and 10 replicated platforms, a guard-targeted attack dataset (WARD-PIG), and an adaptive adversarial attack training procedure (A3T) that iteratively evolves attacker and guard models to improve robustness. Experiments demonstrate that WARD achieves near-perfect recall on multiple out-of-distribution (OOD) benchmarks, maintains very low false positive rates preserving agent utility, robustly withstands guard-targeted and adaptive attacks under significant distribution shifts, and runs efficiently in parallel with web agents without latency overhead. WARD thus provides a strong practical foundation for secure web-agent operations in adversarial open-web environments.

Key findings

WARD-Base dataset contains 177,585 samples from 709 distinct URLs, 10 replicated platforms, covering 4 injection locations, 6 attack goal types, and 13 injection channels.
WARD-0.8B and WARD-2B models achieve over 99% recall and accuracy on the WARD-Test dataset and out-of-distribution benchmarks Popup, EIA, VPI, and WASP, significantly outperforming 25 baseline models.
WARD maintains 100% recall under Prompt Injection on Guard (PIG) attacks across all tested benchmarks and attack modalities.
Under adaptive adversarial attack training (A3T) with up to 10 attack attempts per sample, WARD reduces Sample Success Rate (SSR) and Attempt Success Rate (ASR) to below 6%, whereas state-of-the-art baselines show SSR and ASR above 70%.
In a cross-domain adaptive attack evaluation on 1,700 samples spanning 13 diverse datasets (QA, summarization, RAG, code), WARD achieves average ASR around 3%, compared to over 90% for prior defenses like PromptArmor and PromptGuard.
WARD achieves near-zero false positive rates (<0.5%) and negligible agent utility degradation across multiple WebArena benchmark tasks, while existing guards have false positive rates often exceeding 5%.
WARD runs more efficiently than competing guard models, generating fewer output tokens and requiring less inference time, and can operate in parallel with agent models without adding latency.
The proposed Adaptive Adversarial Attack Training (A3T) co-evolves attacker and guard models using memory mechanisms and validator filtering to find progressively stronger adversarial samples and thereby improve robustness.

Threat model

The adversary can insert malicious prompt injection content into any web input the agent observes, including HTML and screenshots, aiming to manipulate the agent into unsafe or unauthorized actions. The adversary can also target the guard model itself (Prompt Injection on Guard - PIG) by embedding deceptive instructions seeking to cause misclassification. Attackers can adapt iteratively based on guard outputs but lack access to internal guard states or future model parameters. The guard must detect malicious input without knowledge of internal agent reasoning, operating solely on observed multimodal inputs.

Methodology — deep read

Threat Model & Assumptions: The adversary embeds deceptive prompt injection content into open web observations received by the web agent at each interactive step. Injection may occur in HTML, screenshot, or both modalities and aims to manipulate the agent or the guard model itself (Prompt Injection on Guard - PIG). The adversary can adapt attacks iteratively by observing guard responses but cannot access internal agent or guard states. The guard sees only the observed webpage content and classifies whether the input contains malicious injection.
Data: The authors build two complementary branches for data collection. The overlay branch collects real web pages from high-traffic URLs (709 URLs) across diverse categories, generating benign user tasks and recording HTML and screenshots. The native branch simulates 10 high-risk platforms (e.g., messaging, social media) replicating user-generated content scenarios. The combined WARD-Base dataset contains 177,585 samples (about half malicious injections) covering 4 injection locations (HTML only, screenshot only, both, none), 6 attack goal types (e.g., data exfiltration, unauthorized actions), and 13 injection channels (e.g., popups, banners, comments). Samples are labeled with detailed metadata and human/LLM-generated reasoning annotations via an iterative generator–evaluator loop.
Architecture / Algorithm: WARD is a guard model trained to classify inputs as malicious or benign and also localize the injection and infer the attack goal with natural language reasoning output. The input is multimodal: HTML content and screenshots paired with the user instruction. The model is fine-tuned with supervised learning on the annotated WARD-Base and WARD-PIG datasets. WARD-PIG extends WARD-Base by augmenting samples with guard-targeted PIG prompts to teach detection of adversarial manipulations targeting the guard itself.
Training Regime: Training proceeds in three stages. (1) Fine-tuning on WARD-Base for general prompt injection detection across domains and modalities with cross-entropy loss over all target tokens (label, location, goal, reasoning). (2) Further fine-tuning on WARD-PIG to build robustness against guard-targeted adversarial prompts. (3) Adaptive Adversarial Attack Training (A3T), an iterative process where a memory-based attacker generates new adversarial injections that bypass the current guard model; a validator filters valid attacks, and successful evasions are added to the training set used to update the guard model with Reinforcement Learning via PPO optimizing a reward based on correct label and location predictions. Training uses multiple cycles, each generating progressively harder adversarial samples. Training details such as batch size, epochs, optimizer, and seeds are described in appendices (not summarized here).
Evaluation Protocol: WARD is evaluated using multiple held-out benchmarks including WARD-Test (50 URLs/6 platforms disjoint from training), Popup, EIA, VPI, WASP, and PIArena for cross-domain testing. Metrics include recall, precision, F1, accuracy, false positive rate, agent utility degradation, and latency/runtime. Adaptive adversarial robustness is tested via A3T with up to 10 attack attempts per sample, measuring success rates of attacks. Robustness to PIG attacks (guard-targeted) is evaluated separately. Agent utility is assessed via false positive flags and degradation in task success using WebArena benchmarks with GPT-4o agents. Efficiency measured by output tokens and inference time on NVIDIA H200 GPU in parallel agent deployment. Baselines are 25 prior guard models spanning closed-source APIs, open-source instruct models, and existing guard architectures.
Reproducibility: Code and models are publicly released at https://github.com/caothientri2001vn/WARD-WebAgent. Datasets WARD-Base and WARD-PIG are described but links or access details are not specified. Detailed training prompts and PIG prompts are in appendices. Model checkpoints include WARD-0.8B and WARD-2B variant sizes.

Example End-to-End: Starting from collected benign webpage observation (HTML + screenshot + user instruction), authors generate multiple attack variants by injecting malicious prompts into specified channels and locations. These labeled samples (including reasoning annotations) train the base guard model. Then PIG prompts are inserted targeting the guard's output to fine-tune for robustness. Next, adaptive attacks are generated iteratively: a memory-based attacker produces candidate prompts attempting to evade current guard; a validator checks prompt validity; successful evasions augment training data to update guard via PPO. Over multiple cycles, guard model robustness improves, demonstrated by recall >99% and low false positives, as well as low attack success rates under adaptive stress tests.

Technical innovations

WARD-Base: A large-scale, multi-domain, multimodal dataset of 177K samples from real and simulated web platforms covering diverse injection locations, channels, and attack goals for training prompt injection guards.
WARD-PIG: A novel dataset augmenting prompt injection data with guard-targeted prompts that explicitly seek to manipulate the guard model itself, improving robustness to such adversarial attacks.
Adaptive Adversarial Attack Training (A3T): An iterative co-evolution training framework coupling a memory-augmented attacker generating adaptive prompt injections and a guard model updated via reinforcement learning to counter evolving attacks.
Dual-branch data collection pipeline combining overlay attacks on real webpages and native attacks from user-generated content simulation for comprehensive injection coverage.
Iterative generator–evaluator loop for reasoning generation that encourages the model to produce grounded explanations rather than relying on spurious correlations.

Datasets

WARD-Base — 177,585 samples — collected from 709 real URLs and 10 replicated simulated platforms
WARD-PIG — 21,000+ samples — derived from WARD-Base with guard-targeted prompt injections

Baselines vs proposed

WebAgentGuard-4b: Recall on WARD-Test = 78.4% vs WARD-0.8B: 99.6%
WebAgentGuard-8b: Recall on Popup benchmark = 85.2% vs WARD-2B: 100%
PromptArmor: Cross-domain attack success rate on PIArena = ~90% vs WARD-0.8B: 3.23%
DataSentinel: Cross-domain ASR on PIArena = 45.71% vs WARD-2B: 3.08%
False positive rate on WebArena tasks: WebAgentGuard-4b = 9.21% avg vs WARD-0.8b = 0.26%
Inference efficiency: WARD guard runs faster than agent model itself; existing guards slower and higher token output

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.15030.

Fig 2

Fig 2: Overview of the WARD framework.

Fig 5

Fig 5: Examples of popup-based prompt injection across datasets. Left: a popup example from

Fig 6

Fig 6: Representative failure case on Kleinanzeigen. The attack inserts a plausible “Smart

Fig 7

Fig 7: Injection channel distribution in WARD-Base. The left chart shows overlay-branch channels,

Limitations

WARD primarily evaluates on simulated and selected datasets; real-world web environments may have more diverse or novel attack types beyond current coverage.
The large WARD dataset and PIG augmentation datasets are not fully public, limiting independent reproducibility and external evaluation.
Adaptive adversarial training depends on the strength and design of the memory-based attacker; unknown if all realistic sophisticated attack strategies are covered.
Defenses tested mostly on scripted automated agents; potential gaps exist for human-in-the-loop or hybrid agents with different behavior.
Efficiency reported on high-end GPU hardware; latency and resource requirements on lower-end or embedded deployments remain to be evaluated.
Study focuses on malicious prompt injections but does not explicitly address side-channel or timing attack vectors.

Open questions / follow-ons

How well does WARD generalize to completely novel web platforms or injection modalities not seen in training, such as video or dynamic interactive content?
Can the adaptive adversarial attacker framework (A3T) be extended to cover multi-agent or multi-step coordinated attacks across sessions?
What are the trade-offs between guard model size, complexity, and latency for deployment in resource-constrained environments?
How can guard reasoning explanations be used for human-in-the-loop verification or automated downstream mitigation strategies?

Why it matters for bot defense

For bot-defense practitioners, WARD demonstrates that robust prompt injection detection must be multimodal and trained on large, diverse datasets spanning numerous injection locations, channels, and semantic attack goals to generalize well. Existing approaches focusing on text-only or static guards struggle with out-of-distribution attacks and guard-targeted adversarial manipulations. The Adaptive Adversarial Attack Training (A3T) method illustrates the importance of simulating evolving adversaries to improve robustness beyond static defenses. Captcha and bot detection systems could integrate similar guard models inspecting web content that agents process, benefiting from WARD's ability to maintain near-zero false positives to preserve legitimate user workflows while preventing adversarial hijacking. Efficiency evaluations indicate feasibility of running guard checks in parallel without user-perceived latency. More broadly, WARD's dataset construction and guard-targeted prompt modeling offer a framework to develop CAPTCHA challenges or gating mechanisms that are harder for adversarial bots to evade over time.

Cite

bibtex

@article{arxiv2605_15030,
  title={ WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections },
  author={ Tri Cao and Yulin Chen and Hieu Cao and Yibo Li and Khoi Le and Thong Nguyen and Yuexin Li and Yufei He and Yue Liu and Shuicheng Yan and Bryan Hooi },
  journal={arXiv preprint arXiv:2605.15030},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15030}
}

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​