AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web
Source: arXiv:2604.10938 · Published 2026-04-13 · By Shanshan Zhong, Kate Shen, Chenyan Xiong
TL;DR
AgentWebBench introduces a first large-scale benchmark specifically designed to evaluate multi-agent coordination in the emerging Agentic Web paradigm, where user agents interact with multiple autonomous content agents each controlling their own website's data. The benchmark covers four key web information access tasks that represent realistic user intents — ranked retrieval (web search, web recommendation) and open-ended synthesis (question answering, deep research) — on a decentralized architecture over 100 websites containing over 18 million documents. The authors evaluate seven advanced large language models (LLMs) and three progressively complex coordination strategies ranging from tool-based approaches to fully multi-agent communication. Their experiments reveal a consistent performance gap between decentralized coordination and centralized retrieval baselines, but this gap narrows significantly as model scale increases and can even reverse on question answering due to iterative evidence gathering. AgentWebBench additionally enables systemic analyses revealing traffic concentration on a small set of popular websites, benefits from test-time scaling (multiple reasoning steps), and a tradeoff between interaction efficiency and coverage. Failure analyses further dissect bottlenecks into user-agent planning and answer synthesis versus content-agent retrieval quality and evidence. Overall, AgentWebBench offers critical new empirical and analytic foundations for studying how autonomous LLM agents will coordinate to access and synthesize decentralized web information in realistic settings.
Key findings
- Multi-agent coordination lags behind centralized classical IR on web search with NDCG@3 for Gemini-3: Classical 52.21% vs Multi-Agent 44.34%, but the gap shrinks with larger models.
- On question answering, Multi-Agent coordination can outperform centralized retrieval, e.g., with Gemini-3 accuracy 67.9% Multi-Agent vs 60.38% Classical.
- Tool-based prompted website selection improves over embedding similarity selection on web search, indicating LLM reasoning helps website selection.
- Test-time scaling (thinking mode) improves coordination accuracy and reduces malformed requests, e.g., boosting Qwen3-30B web search NDCG@3 from ~30% to ~35%.
- AgentWebBench corpus comprises 100 websites with 18.4 million documents from ClueWeb22-B, served by autonomous content agents each with independent dense retrieval indices.
- Multi-agent coordination concentrates web traffic on a small subset of websites like sciencedirect.com and wikipedia.org compared to more diverse source distributions in centralized retrieval.
- Failure attribution shows user agents dominate errors on planning and website selection in web search and recommendations (50-60%), while content agents contribute more in retrieval-heavy deep research.
- Interaction efficiency analysis reveals stronger models (Gemini-3) perform better by making more agent turns and retrieval requests, trading off validity for coverage.
Threat model
Adversaries are not the focus; the setting assumes benign autonomous agents operating with constrained access to corpora via website-specific interfaces. The user agent cannot directly access raw documents and must rely on content agents who control domain-specific retrieval, with no capability for adversarial disruption or information poisoning considered.
Methodology — deep read
The authors introduce AgentWebBench to study decentralized multi-agent coordination in web information access tasks. The threat model assumes a user agent acting on behalf of the user, with no direct corpus access, interacting only via content agents controlling individual website domains. Content agents manage retrieval on their own segments of a large web corpus under realistic domain constraints.
The dataset is drawn from ClueWeb22-B, filtered to 100 high-volume English-language websites with 18.4 million total documents. Queries and labels come from established datasets adapted per task: MS MARCO Web Search (354 queries), ORBIT user histories for web recommendation (281 samples), and DeepResearchGym for question answering (53 queries) and deep research long reports (331 queries).
AgentWebBench defines two agent types: a single user agent that selects websites and synthesizes answers without corpus access, and 100 content agents each controlling retrieval within one website. Retrieval within each content agent uses a dense retriever built from MiniCPM-Embedding-Light over the website's documents, accessible only through content-agent tool calls.
They propose three coordination strategies: (1) ToolE with embedding similarity for website selection and tool-based document retrieval without agent communication; (2) ToolP that uses LLM reasoning with prompting to select websites but still tool-based retrieval; (3) Multi-Agent enabling agent-to-agent communication where user and content agents cooperate conversationally to perform dynamic website selection, iterative retrieval with summaries, and answer synthesis.
Seven advanced LLMs power the agents, including Qwen3 (various sizes up to 80B), GPT-5-mini, Gemini-3, and DeepSeek-V3.2. Agents are allowed up to 15 interaction turns. Some runs use "thinking mode," an explicit multi-step reasoning setting to improve planning quality. Agents operate with task-specific prompts and corpus information.
For evaluation, they use task-appropriate metrics: NDCG@k and Recall@k for ranked retrieval (web search and recommendation), accuracy and token-level F1 for question answering short answers, and key-point recall, contradiction, clarity, and insight judged by LLMs for deep research reports. They compare the coordination methods against centralized classical IR baselines with universal corpus access.
Resource and code are released for reproducibility. Experiments study the effect of model scale, coordination complexity, test-time scaling, and interaction budgets. A failure mode analysis traces errors to user or content agents across discriminative tasks. Efficiency analysis profiles interaction turns, retrieval requests, agent coverage, and validity across models and tasks. The experiments provide a comprehensive empirical characterization of multi-agent coordination effectiveness, ecosystem impact, efficiency, and failure patterns in the Agentic Web setting.
Technical innovations
- First large-scale benchmark (AgentWebBench) evaluating multi-agent coordination spanning ranked retrieval and generative synthesis tasks on a realistic decentralized web corpus.
- Three progressive coordination strategies incorporating embedding-based selection, LLM reasoning, and autonomous agent-to-agent communication to model Agentic Web paradigms.
- Integration of dense in-domain retrieval within individual content agents across 100 website domains, enforcing decentralized access constraints.
- Analysis of multi-agent interaction design trade-offs including test-time scaling (thinking mode), interaction budgeting, and traffic concentration effects on web source diversity.
Datasets
- AgentWebBench corpus — 18,427,770 documents — derived from English subset of ClueWeb22-B (public)
- MS MARCO Web Search subset — 354 queries — publicly available benchmark adapted
- ORBIT user histories — 281 samples — used for web recommendation task
- DeepResearchGym short-answer subset — 53 queries — used for question answering
- DeepResearchGym long-answer subset — 331 queries — used for deep research reports
Baselines vs proposed
- Classic IR: Web search NDCG@3 = 52.21% (Gemini-3) vs Multi-Agent: 44.34%
- Classic IR: Question answering accuracy = 60.38% (Gemini-3) vs Multi-Agent = 67.92%
- ToolE vs ToolP (web search): ToolP NDCG@3 40.29% vs ToolE 34.23% (Qwen3-14B) showing LLM reasoning helps website selection
- Test-time scaling improves Qwen3-4B web search NDCG@3 from ~30% (no thinking) to ~35% (thinking mode)
- Multi-Agent vs ToolP (web search): ToolP often stronger on retrieval tasks, Multi-Agent gains on generative tasks
- Within Qwen3 series, model scale correlates positively with all metrics (question answering accuracy improved by ~10% from 4B to 80B)
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.10938.

Fig 1: Overview of AgentWebBench. The benchmark models the Agentic Web paradigm where user agents coordinate with multiple

Fig 2 (page 2).

Fig 3 (page 2).

Fig 4 (page 2).

Fig 5 (page 2).

Fig 6 (page 2).

Fig 7 (page 2).

Fig 8 (page 2).
Limitations
- Overall performance remains modest with room for improvement, especially on web recommendation where recall is near zero.
- Coordination strategies do not fully address retrieval instability and evidence reliability in content agents, affecting retrieval-heavy tasks.
- Dataset excludes many smaller or niche websites, possibly exaggerating traffic concentration effects.
- Failure analyses restricted to tasks with ground truth, leaving deep research failure modes less explored.
- Interaction turn budgets capped at 15 may limit exploration of longer planning horizons or adaptive stopping criteria.
- No adversarial evaluation against malicious or deceptive content agents or user agents reported.
Open questions / follow-ons
- How can user agents improve planning and answer synthesis to reduce failure rates in decentralized multi-agent coordination?
- What methods can content agents adopt to enhance retrieval stability and evidence reliability across diverse web domains?
- How to design adaptive interaction budgeting and dynamic stopping criteria that optimize tradeoffs between efficiency and coverage in multi-agent collaboration?
- What are the long-term ecosystem effects of traffic concentration on content diversity and discoverability in Agentic Web deployments?
Why it matters for bot defense
Bot-defense and CAPTCHA practitioners should note that the Agentic Web paradigm shifts information access toward coordinated multi-agent interactions mediated by website-controlled interfaces, rather than direct centralized retrieval. This decentralization concentrates traffic to a few dominant web domains, potentially affecting bot detection signals based on traffic diversity or source reputation. Furthermore, user agents rely on iterative planning and synthesis that can be sensitive to retrieval failures or inconsistencies, which may introduce exploitable input-output dependencies. Agents employing test-time scaling show improved reliability via multi-step reasoning, indicating that bot-detection systems may need to incorporate temporal interaction patterns rather than single-step queries. Finally, the coordination protocols and interface limitations modeled in AgentWebBench reveal attack surfaces where careful validation of request formatting and agent adherence to protocols is necessary to prevent automation abuse. Understanding multi-agent behaviors and failure modes via benchmarks like AgentWebBench can inform more robust challenge-response designs and anomaly detection in increasingly agentic web environments.
Cite
@article{arxiv2604_10938,
title={ AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web },
author={ Shanshan Zhong and Kate Shen and Chenyan Xiong },
journal={arXiv preprint arXiv:2604.10938},
year={ 2026 },
url={https://arxiv.org/abs/2604.10938}
}