Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

Source: arXiv:2606.13385 · Published 2026-06-11 · By Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai et al.

TL;DR

This paper addresses a crucial but under-explored aspect of prompt injection attacks on large language model (LLM)-driven web agents: the asymmetric and stakeholder-dependent nature of harm in real-world deployments. Unlike prior benchmarks that focus on technical attack feasibility broadly, the authors introduce StakeBench, a comprehensive stakeholder-centric benchmarking framework. StakeBench explicitly models harm to three main stakeholder categories—Users, third-party Sellers, and the Platform—and dissects prompt-injection attacks into 12 concrete objectives instantiated into 264 adversarial cases over realistic e-commerce scenarios. The evaluation covers multiple real agent architectures and LLM backbones, providing nuanced multi-axis metrics that jointly capture attack success, task disruption, and behavioral irregularity.

Key findings reveal broad and heterogeneous vulnerabilities: no attack objective is reliably resisted by current agents, and failures manifest through qualitatively distinct regimes such as stealthy parasitism (attack succeeds without disrupting user tasks), misaligned disruption (task disrupted but attack objective not realized), and compounded failure (both attack success and task failure). These regimes vary strongly by stakeholder target and depend on factors like semantic alignment of adversarial payloads with user intent and the timing of exposure to malicious content. The paper highlights that conventional attack-centric evaluation obscures these asymmetric, multi-dimensional risks, and advocates for a stakeholder-aware perspective to better assess deployed LLM web agents’ security.

Key findings

Indirect prompt injection (IPI) Attack Success Rates (ASR) range from 41.67% to 68.16% across two agent systems (NanoBrowser and BrowserUse) paired with GPT-5 and Gemini-2.5-Flash backbones.
Different LLM backbones notably affect vulnerability: switching from GPT-5 to Gemini-2.5-Flash raises IPI ASR by 26.49 points on NanoBrowser and 6.2 points on BrowserUse (Table 1).
Seller-targeted attacks have the highest ASR (up to 62.5%) and Task Deviation Rate (TDR) (~38.66%), indicating high adversarial success and user task disruption, while User-targeted attacks show lower TDR (~18%), representing stealthier failure modes (Table 2).
Platform-targeted attacks show moderate ASR (~49%) but elevated Behavioral Irregularity Rate (BIR), up to 17.51%, indicating unstable execution processes without consistent attack success.
No evaluated attack objective achieves robust behavior; all 12 objectives fall into one of three failure regimes: stealthy parasitism, misaligned disruption, or compounded failure (Fig 3).
Semantic alignment of adversarial payloads with the user's original intent strongly modulates attack success; high similarity induces ASR up to 79.17% on NanoBrowser and 70.83% on BrowserUse, while different-objective payloads drop ASR below 30% and increase TDR and BIR (Table 3 panel A).
Cross-cue consistency affects vulnerability: contradictory supporting page-level signals reduce ASR on GPT-5 backbone from 55.56% to 19.44%, though Gemini maintains over 80% ASR regardless (Table 3 panel B).
Earlier exposure to injected content in the agent execution trajectory increases attack success rate from 79.2% to 97.2%, indicating the impact of pre-exposure failures on measured vulnerability (Fig 4).

Threat model

The adversary controls untrusted public content surfaces within the agent's observed environment, including product reviews, ratings, and seller metadata displayed on e-commerce pages. They cannot alter user input instructions, platform infrastructure, system prompts, or browser security state. The adversary injects malicious prompt instructions indirectly (indirect prompt injection) or directly (direct prompt injection, mainly as reference) to steer the LLM-based web agent away from the user's intended task, causing harm to one or multiple stakeholders. The threat assumes realistic constraints consistent with public web content injection without elevated privileges.

Methodology — deep read

The authors begin with the threat model of deployed LLM-based web agents operating on real-world e-commerce platforms where they autonomously complete shopping-related tasks on behalf of users. The threat assumes an adversary that can inject malicious instructions into environmental content observable by the agent, such as user reviews, ratings, and seller metadata. The adversary cannot modify user instructions, platform infrastructure, system prompts, or browser state.

Data provenance centers on instantiated adversarial prompt injections realized through 22 reusable attack templates targeting 12 distinct adversarial objectives that collectively represent harms to three stakeholder categories: User, Seller, and Platform. These templates (9 direct prompt injection, 13 indirect prompt injection) embed malicious payloads plausibly appearing in real product pages. Each template is instantiated across 12 product categories producing 264 executable adversarial cases. Benign user tasks reflect realistic shopping intents. Agent interactions and trajectories are executed on OneStopMarket environment from VisualWebArena, a functional e-commerce simulation.

The benchmark evaluates four distinct agent-backbone configurations: two agent systems (NanoBrowser and BrowserUse) each paired with backbone LLMs GPT-5 and Gemini-2.5-Flash. These agents differ architecturally; NanoBrowser follows a multi-agent pipeline with planning/navigation modules, while BrowserUse uses a single-agent iterative browser-control loop. Each adversarial case is executed three times per configuration to control variance, for 3,168 runs total. Baseline benign runs without injection measure inherent failure rates.

Evaluation metrics encompass three axes: Attack Success Rate (ASR), indicating whether the adversarial objective was achieved end-to-end; Task Deviation Rate (TDR), measuring disruption to the user's intended benign task; and Behavioral Irregularity Rate (BIR), capturing pathological execution behaviors like looping or unstable navigation. A large language model (GPT-5) powered judge labels runs by assessing the executed trajectory and final environment state according to template-specific success conditions, with human annotations validating judge quality.

The joint distribution of ASR and TDR defines four failure regimes: Robust Behavior (low ASR, low TDR), Stealthy Parasitism (high ASR, low TDR), Misaligned Disruption (low ASR, high TDR), and Compounded Failure (high ASR, high TDR). BIR provides complementary process-level context.

To illustrate empirical analysis, consider the E3 "Coerced or Induced Purchases" template where attacks target user purchase behavior. Semantic alignment experiments vary injected product similarity relative to user intent, and results show significant differences in ASR and failure mode. This detailed empirical approach is repeated across multiple templates, stakeholders, agents, and LLMs.

The paper provides extensive appendices detailing template design, threat model formalization, judgment protocols, and per-template results. The authors released StakeBench as open source at https://github.com/StakeBench/SBC to enable reproducibility and community extension.

Technical innovations

Stakeholder-centric prompt injection benchmarking that explicitly distinguishes harms affecting Users, Sellers, and Platform entities, unlike prior attack-centric frameworks.
Decomposition of prompt injection attacks into 12 concrete adversarial objectives and instantiation into 264 realistic adversarial cases spanning 12 product categories.
Multi-axis evaluation framework combining Attack Success Rate, Task Deviation Rate, and Behavioral Irregularity Rate to characterize nuanced failure regimes beyond binary attack success.
Demonstration that attack effectiveness and harm profiles vary substantially with semantic alignment of adversarial payloads and timing of exposure in long-horizon web agent trajectories.

Datasets

StakeBench benchmark — 264 adversarial cases across 12 product categories — instantiated from reusable attack templates over the OneStopMarket VisualWebArena environment

Baselines vs proposed

NanoBrowser GPT-5 IPI ASR = 41.67% vs Gemini-2.5-Flash IPI ASR = 68.16%
BrowserUse GPT-5 IPI ASR = 52.99% vs Gemini-2.5-Flash IPI ASR = 59.19%
NanoBrowser average TDR 29.91% vs BrowserUse average TDR 32.26%
NanoBrowser average BIR 6.84% vs BrowserUse average BIR 16.77%
Seller-targeted attack ASR up to 62.5%, TDR up to 38.66%, User-targeted ASR ~50%, TDR ~18% (Table 2)
High semantic similarity in E3 attacks raises ASR from ~30% to ~79%, with corresponding TDR drop (Table 3 Panel A)
Cue inconsistency reduces ASR from 55.56% to 19.44% on GPT-5 but not on Gemini (Table 3 Panel B)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13385.

Fig 1

Fig 1: Overview of StakeBench. The agent operates within an interactive shopping interface

Fig 2

Fig 2: Overview of the attack taxonomy in StakeBench.

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Benchmark focuses primarily on e-commerce domain; generalizability to other web-agent use cases is not demonstrated.
Indirect prompt injection (IPI) is the main channel studied; while DPI included as reference, other attack vectors (e.g., system prompt poisoning) are excluded.
Evaluation uses two agent architectures and two backbone LLMs; results may differ with emerging agent designs or larger LLM variants.
Judgment of success metrics relies on LLM-based automated judges with human validation; subtle failure cases may be misclassified.
Long-horizon task failures complicate attribution of vulnerability to injected content versus inherent agent instability.
No adversarially adaptive defense mechanisms evaluated; benchmark highlights vulnerability but not mitigation effectiveness.

Open questions / follow-ons

How would defensive strategies such as robust context filtering or dynamic prompt sanitization affect StakeBench vulnerability profiles?
Can the stakeholder-centric threat modeling be extended beyond e-commerce to other web-agent contexts like travel booking or financial services?
How do emergent LLM capabilities (e.g., better context understanding or adversarial robustness) shift the balance across failure regimes over time?
What formal verification or runtime monitoring techniques could be integrated to detect or mitigate multi-stakeholder asymmetric harm in real deployments?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, StakeBench provides a refined framework for understanding how prompt injection risks translate into asymmetric, stakeholder-specific harms in deployed LLM-driven agents. The multi-dimensional evaluation metrics and failure regime taxonomy reveal that focusing solely on attack success or aggregate disruption risks missing covert or indirect threats that could evade detection while still harming third parties such as sellers or platforms. This insight is vital when designing mitigation or monitoring solutions that must balance user experience preservation against preventing stealthy or structural manipulations. The detailed taxonomy and benchmark cases can inform scenario design, red-teaming, and risk assessment efforts by highlighting the importance of multi-party impact analysis and execution trajectory dynamics in real-world web interactions.

Cite

bibtex

@article{arxiv2606_13385,
  title={ Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents },
  author={ Zihao Wang and Yiming Li and Yutong Wu and Zheyu Liu and Kangjie Chen and Fok Kar Wai and Pin-Yu Chen and Vrizlynn L. L. Thing and Bo Li and Dacheng Tao and Tianwei Zhang },
  journal={arXiv preprint arXiv:2606.13385},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13385}
}

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​