EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Source: arXiv:2606.13120 · Published 2026-06-11 · By Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

TL;DR

This paper addresses the problem of benchmarking search agents—large language models (LLMs) integrated with search tools—on evolving, contamination-free knowledge. Existing benchmarks such as BrowseComp rely on static, fixed datasets that risk contamination by leaking into training data, enabling models to exploit parametric memorization instead of genuine web retrieval and reasoning. To overcome this fundamental flaw, the authors introduce EvoBrowseComp, an automated, continually updatable benchmark of 800 complex question-answer pairs (400 English, 400 Chinese) generated from live web data after a fixed timestamp (Jan 1, 2026).

The benchmark is constructed via a novel, fully automated three-agent collaborative framework: (1) a QA synthesis agent crawls fresh and partially non-fresh web knowledge to iteratively generate multi-hop QA pairs; (2) an information filtering agent cross-validates retrieved evidence for credibility and filters out overly popular knowledge to block shortcuts; and (3) a high-level guidance agent formalizes reasoning paths as graphs and detects redundancies or shortcut reasoning to iteratively guide synthesis towards logically complex questions. Extensive evaluation with state-of-the-art LLMs shows that even top models with tool access barely exceed 45% accuracy, and drop to under 11% without tools, demonstrating both the difficulty and freshness of the benchmark. The pipeline enables continuous regeneration to prevent contamination and keep pace with evolving world knowledge and search agent advances.

Key findings

EvoBrowseComp contains 800 complex questions, equally split between English and Chinese, all synthesized with knowledge emerging after January 1, 2026, preventing parametric memorization.
Even the best-performing model, Claude-Opus-4.6 with tool access, achieves only 44.8% accuracy on English questions and 36.8% on Chinese, highlighting benchmark difficulty.
Without tool access, Claude-Opus-4.6's accuracy drops dramatically to 6.0% (English) and 8.8% (Chinese), confirming that answering requires genuine retrieval and multi-hop reasoning, not static recall.
Over 90% of questions require reasoning across at least three distinct domain root web sources, necessitating broad horizontal search and synthesis.
Average question lengths are longer than prior benchmarks (142.5 tokens English, 162.3 Chinese) and reasoning graphs have similar or more nodes than BrowseComp, reflecting structural complexity.
93% of synthesized evidence lists were fully correct on human evaluation, 90% of questions were consistent and unambiguous, and 100% of answers logically follow from evidence, indicating high quality.
Models evaluated on EvoBrowseComp perform significantly worse than on prior BrowseComp datasets (e.g. GLM-5: 39.2% vs 62.0% English accuracy), confirming the challenge of evolving knowledge and contamination resistance.
Some LLMs exceed the maximum allowed tool calls (40) frequently (DeepSeek-V4-Max: up to 82.5% samples), revealing reasoning efficiency issues despite strong capabilities.

Threat model

The adversary is a search agent leveraging an LLM augmented with web search and browsing tools, with prior knowledge cutoff before 2026-01-01. The adversary cannot access future or leaked answers and must rely on live retrieval and multi-hop reasoning over fresh evidence. The threat is cheating via parametric memorization or shortcut reasoning based on stale or over-covered facts, which the benchmark aims to block.

Methodology — deep read

Threat Model & Assumptions: The adversary is a search agent—an LLM augmented with web tools—attempting to answer complex questions relying on retrieving recent knowledge from the live web. The adversary does not have access to future information outside pre-training cutoffs, and cannot exploit parametric memorization of the test set due to its recency. The threat model aims to prevent test set contamination and reasoning shortcutting by encouraging reliance on genuine retrieval and multi-hop logical reasoning.
Data Collection: The benchmark comprises 800 QA pairs (400 English, 400 Chinese) synthesized from live web sources dated after January 1, 2026. Seed entities (~50K) are automatically collected via an LLM-search agent (DeepSeek-V3.2) in 9 domains subdivided into 50 fine-grained subdomains. For each entity, a QA synthesis agent iteratively interacts with search and visit web tools (Google Search, page scraping) to gather evidence and produce candidate QA pairs. Evidence contains fresh knowledge (post-timestamp) and limited non-fresh knowledge, each classified accordingly.
Three-Agent Collaborative Framework:

QA Synthesis Agent: Iteratively mines multi-turn web data to gather evidence lists (E = {ϵ1,...,ϵn}), and synthesizes complex, obfuscated QA pairs from them. Questions are progressively refined through multiple iterations.
Information Filtering Agent: Validates evidence credibility by cross-checking sources, labeling each fresh evidence as "credible", "not credible" or "unclear". Non-fresh evidence is judged for popularity to block overexposed facts leading to reasoning shortcuts. Only credible fresh and non-popular non-fresh evidence is retained; otherwise, the synthesis agent must recollect.
High-level Guidance Agent: Parses questions into reasoning graphs with nodes (entities/attributes) and edges representing logical operations (projection, intersection, complement). It detects redundancy (isolated nodes) and shortcuts (structural bypasses) in reasoning paths using NetworkX toolkit. Based on this, it generates textual instructions guiding the QA synthesis agent to improve question complexity and avoid shortcuts in subsequent iterations.

Iteration continues until questions meet criteria: no redundancy/shortcuts, minimum number of iterations (≥5), minimum reasoning edges (≥5).

Data Quality Assurance: Fluency and clarity of QA pairs are judged by an LLM (DeepSeek-V3.2). Unambiguousness and uniqueness are checked by cross-validating with 6 LLMs; questions with alternative answers or too easy (answered correctly by >5 out of 6 LLMs) are discarded. Human evaluation of 100 sampled QA pairs shows 87% pass correctness, consistency, and answer inference checks.
Domain & Complexity Analysis: The data covers 9 broad domains with balanced distribution. Questions are longer and structurally more complex than prior benchmarks, with average 4.2 distinct root domains per question, over 90% requiring cross-source reasoning.
Evaluation Setup: Search agents (multiple LLMs) answer questions in multi-turn interaction with Google Search and Visit tools. Use an LLM judge model (GLM-5-Chat) to score accuracy based on predicted answers. Models evaluated both with and without tool access to measure reliance on parametric memorization.
Training & Hardware: Evaluation uses large-scale models deployed on NVIDIA H20 GPUs with up to 32 GPUs and long context length (128K tokens) to fairly handle complex, multi-hop questions.

Example: For a seed entity, the QA synthesis agent queries the live web, collects fresh and filtered evidence, synthesizes a multi-hop QA pair whose reasoning graph is analyzed by the guidance agent for redundancies or shortcuts. The guidance agent then instructs the QA synthesis agent to refine the question to increase complexity and disallow simple shortcut answers. Over multiple iterations, the question stabilizes, passes quality filters, then is added to the dataset.

Technical innovations

A fully automated three-agent collaborative framework combining live-web retrieval, credibility/popularity filtering, and reasoning graph-guided synthesis to produce contamination-free evolving QA pairs.
Use of reasoning graphs with projection, intersection, and complement logical operations to detect and eliminate reasoning redundancies and shortcuts in synthesized questions.
Integration of credibility and popularity heuristics in a filtering agent to discard unreliable or over-covered evidence and enforce dependency on fresh, hard-to-memorize knowledge.
Continuous regeneration of the benchmark through automated pipeline enabling updates that maintain temporal freshness and prevent test set contamination without human annotation.

Datasets

EvoBrowseComp — 800 QA pairs (400 English, 400 Chinese) — Automatically synthesized from live web after 2026-01-01

Baselines vs proposed

Claude-Opus-4.6 w/ tools English accuracy: 44.8% vs. w/o tools: 6.0%
Claude-Opus-4.6 w/ tools Chinese accuracy: 36.8% vs. w/o tools: 8.8%
GLM-5 w/ tools English accuracy: 39.2% vs. w/o tools: 1.1%
Qwen3.5-397B w/ tools English accuracy: 42.0% vs. w/o tools: 2.9%
DeepSeek-V3.2 w/ tools English accuracy: 23.0% vs. w/o tools: 6.3%
Performance drop on EvoBrowseComp vs BrowseComp datasets: GLM-5 from 62.0% to 39.2% (English), DeepSeek-V3.2 from 51.4% to 23.0% (English), showing increased difficulty and contamination resistance
DeepSeek-V4-Flash performance varies with reasoning effort; higher effort improves accuracy but leads to excessive tool calls (up to 82.5% exceeding allowed calls) hurting efficiency

Limitations

Though largely contamination-free, some small fraction (~7%) of evidence lists may involve minor hallucinations or inaccuracies.
The benchmark focuses on open-web knowledge after a fixed timestamp (2026-01-01); evolving knowledge before or from less accessible sources might be missed.
Evaluation relies on automated LLM judges which may imperfectly correlate with human judgments despite strong correlations.
Some newer models like DeepSeek-V4 exceeded tool call limits, pointing to practical limits in reasoning efficiency not fully resolved.
No explicit adversarial evaluation or robustness tests under adversarial retrieval manipulations reported.
Human evaluation sample size is limited (100 QA pairs), which might not fully represent the entire dataset quality.

Open questions / follow-ons

How to further enhance reasoning efficiency to prevent excessive tool calls while maintaining accuracy on complex multi-hop questions.
Can the framework be extended to more languages, domains, and knowledge sources including non-textual web information.
What defenses are effective against adversarial retrieval attacks that attempt to poison or bias the evidence base in evolving benchmarks.
How do model improvements in reasoning and retrieval techniques impact performance on evolving benchmarks over time.

Why it matters for bot defense

For practitioners in bot defense and CAPTCHA design, EvoBrowseComp illustrates a rigorous methodology for evaluating AI agents’ genuine web retrieval and reasoning capabilities on fresh, contamination-free data. The evolving benchmark’s focus on multi-hop, cross-source reasoning highlights the ongoing difficulty in detecting sophisticated bots or assistants that attempt to shortcut via memorized knowledge. The three-agent framework and reasoning graph formalism provide a conceptual blueprint for designing adaptive challenge sets that resist memorization and promote proactive evidence verification. In particular, CAPTCHA and bot defense engineers might leverage similar evolving data synthesis and multi-agent verification pipelines to generate dynamic, high-complexity challenges that scale with attacker capability.

Additionally, the substantial drop in model accuracy when tool access is removed emphasizes the importance of challenging retrieval and synthesis beyond static memorization, relevant for threat models where AI may collude with external knowledge sources. The source diversity and reasoning graph analyses presented could guide design of CAPTCHA challenges that require bots to perform broad, multi-hop reasoning across distinct information sets. Overall, these techniques represent a promising direction for future-proofing evaluation of AI capabilities central to security tasks in bot detection and user verification.

Cite

bibtex

@article{arxiv2606_13120,
  title={ EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge },
  author={ Yunhan Wang and Jiaan Wang and Lianzhe Huang and Xianfeng Zeng and Fandong Meng },
  journal={arXiv preprint arXiv:2606.13120},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13120}
}

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​