Skip to content

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Source: arXiv:2605.22219 · Published 2026-05-21 · By Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie et al.

TL;DR

SGR-Bench addresses the challenge of state-gated retrieval (SGR), a specialized form of web data retrieval requiring agents not only to identify the correct website but also to establish and maintain the correct site-specific retrieval state (e.g., filters, views, hierarchies) to surface answer-bearing evidence. Existing benchmarks largely focus on source discovery and browsing but neglect the critical step of retrieval-state control that governs accessibility of specialized, gated data. The benchmark constructs 100 expert-curated tasks across 6 major domain families and 12 public data ecosystems, each requiring a structured, ordered-table output derived from navigating site-specific controls.

The authors evaluate eight CLI-based large language model (LLM) agentic search systems and three commercial search-agent products on SGR-Bench. The best CLI-based system achieved 66.18% item-level F1, while row-level (full entry) F1 scores and ordering accuracy were considerably lower. A detailed manual audit of failed retrieval trajectories reveals that the dominant error modes are retrieval-scope drift (37.2%) and criterion mismatch (27.6%), indicating that agents often reach the right website but set or preserve filters incorrectly, causing incomplete or inaccurate answer sets. Final answer composition errors account for a much smaller fraction of failure cases (10.3%). These results highlight retrieval-state control as the key bottleneck in complex specialized retrieval tasks, a capability overlooked in prior evaluations.

Key findings

  • Top CLI-based system GPT-5.5 achieves 66.18% item-level F1 but only 43.37% row-level F1 on SGR-Bench, showing a 22.81-point gap due to loss of retrieval-state context.
  • Across all 8 CLI agents and 3 commercial systems, average item-level F1 is 47.85% and average row-level F1 is 26.81%, indicating prevalent partial evidence retrieval without correct structured completion.
  • Manual audit of 156 failed CLI trajectories shows 37.2% failures caused by retrieval-scope drift and 27.6% by criterion mismatch; together they account for 64.7% of failures, dominating error types.
  • Final answer composition errors appear in only 10.3% of audited failures, underscoring the bottleneck is in maintaining site-specific retrieval states rather than synthesis or formatting.
  • CLI-based LLM agents outperform commercial products on mean Item-F1 (53.42% vs 33.00%), suggesting fine-grained control via CLI benefits retrieval-state preservation.
  • Scholarly archives and environmental monitoring tasks yield higher item-level F1 (>57%) compared to regulatory resources and official statistics (<37%), reflecting difficulty in preserving interacting filters and scopes.
  • Constraint-guided task variants yield slightly higher item-level F1 (+1.9%) over goal-oriented variants, but goal-oriented variants have marginally higher row-level F1 (+2.24%), illustrating nuanced tradeoffs in explicit vs implicit guidance.
  • Across all systems, item-level F1 consistently exceeds row-level F1, confirming agents recover field values more often than correctly scoped full rows (Fig 2a).

Threat model

The primary adversary modeled is an automated agent (search agent) tasked with locating answer-bearing evidence on specialized websites by establishing site-specific retrieval states. The agent is not adversarial per se but prone to errors arising from the complexity of controlling site-specific filters, views, and scopes under partial and hierarchical exposure of evidence. The benchmark focuses on evaluating agent capabilities rather than resisting hostile attackers or manipulation attempts. Thus, the threat model assumes honest but fallible retrieval agents operating under realistic constraints, without strategic adversarial attempts to evade or poison the system.

Methodology — deep read

The paper introduces the state-gated retrieval (SGR) task formalization, where an agent must identify the target specialized website W and establish a sequence of retrieval states s1,...,sk through filters, views, hierarchy nodes, or scopes such that answer-bearing evidence E(a) is surfaced. The retrieval state is key since evidence is generally not accessible under the default state s0. SGR-Bench defines tasks that require chained, dependent control of the retrieval state to expose nested or scoped data.

To build SGR-Bench, the authors use a four-stage data curation pipeline: candidate website curation, task design protocol adherence, task drafting, and candidate filtering/validation. Candidate sources are drawn from English Wikipedia external links, prioritized via a Qwen-Plus LLM as likely information-dense retrieval sites, then dual-reviewed by annotators to ensure suitability. The task design protocol enforces six requirements: domain specificity, long-tail source grounding, answer uniqueness and verifiability, ground-truth stability, shortcut resistance, and logical dependency of retrieval steps.

Task drafting is assisted by ChatGPT-5.2 Pro, which proposes questions, answers, and solution trajectories involving that site’s retrieval controls. Drafts undergo form-level review to ensure natural phrasing and amend ill-formed questions or output formats. Next, tasks are preliminarily screened out if solvable via direct shortcuts or unstable in grounding. Three rounds of expert validation verify answer identifiability, necessity of retrieval-state gating, and robustness against shortcutting.

The dataset contains 100 tasks spanning 6 source families (environmental monitoring, regulatory, scholarly, life sciences, official statistics, vulnerability databases) and 12 public data ecosystems. Tasks are evenly split between constraint-guided (explicit retrieval logic) and goal-oriented (implicit, task-focused) variants. All outputs use ordered-table schemas with 2-44 rows per task (median is 4).

For evaluation, eight CLI-based LLM agentic search systems and three commercial search-agent products are tested. All agents receive identical task prompts specifying information needs and output formats. CLI agents use default configurations of Codex CLI or Claude Code CLI interfaces with medium effort and thinking modes. Commercial systems are evaluated via manual web interaction. Output canonicalization removes trivial spelling, punctuation, or variant formatting mismatches, then aligned row-level scoring computes item-level F1, row-level F1, and pairwise order accuracy (P.O.A.)

A manual audit of 156 failed CLI-based agent retrieval trajectories categorizes failure types, revealing that site-specific retrieval state preservation (filters, scopes) is the main error cause. The analysis also profiles difficulty by source families and task variant. OpenRouter platform and official OpenAI API are used for model access. Source code and dataset are publicly available to support reproducibility.

Technical innovations

  • Definition and formalization of state-gated retrieval (SGR) as a retrieval task requiring site-specific retrieval state control to expose gated evidence.
  • Creation of SGR-Bench: a 100-task benchmark explicitly targeting retrieval-state gated access on specialized data-retrieval websites, spanning 6 domains and 12 data ecosystems.
  • Paired constraint-guided and goal-oriented task formulations to enable controlled comparison of explicit vs implicit retrieval-state guidance on the same underlying problems.
  • A multi-stage data curation and expert validation pipeline ensuring domain specificity, shortcut resistance, logical dependency, and stable ground-truths for evaluation robustness.
  • Comprehensive empirical evaluation across eight CLI-based LLM agents and three commercial search products, combined with manual trajectory audit isolating retrieval-scope drift and criterion mismatch as dominant failure modes.

Datasets

Baselines vs proposed

  • GPT-5.5 CLI: Item-F1 = 66.18% vs lowest CLI Seed-2.0 Pro: 29.88%
  • CLI average: Item-F1 = 47.85%, Row-F1 = 26.81%
  • Commercial agents average: Item-F1 = 33.00%
  • Constraint-guided tasks: Item-F1 average 48.71%, Goal-oriented tasks: 46.81%
  • Google Search AI Mode (Commercial): Item-F1 = 14.87% vs GPT-5.5 CLI: 66.18%
  • OpenAI Deep Research (Commercial): Item-F1 = 54.20% vs Claude Opus 4.7 CLI: 61.38%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22219.

Fig 1

Fig 1: Overview of the SGR-BENCH four-stage data curation pipeline. Candidate websites

Fig 2

Fig 2 (page 5).

Fig 3

Fig 3 (page 5).

Fig 4

Fig 4 (page 5).

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

  • The benchmark focuses on specialized public data-retrieval websites curated from Wikipedia external links, which may not cover all domain-specific retrieval environments or corporate intranet settings.
  • Evaluations omit adversarial active attackers attempting to intentionally deceive or poison retrieval state controls, limiting security threat analysis.
  • Trajectory audit analyzed only a subset (156) of failed CLI runs, so error distribution may not generalize fully across all failed cases or commercial agent failures.
  • The commercial system evaluation is manual and limited by interface constraints and timing, potentially biasing comparisons versus automated CLI-based evaluations.
  • The benchmark uses carefully curated tasks resistant to shortcuts but future real-world deployments may face spurious or dynamically changing website controls not captured here.

Open questions / follow-ons

  • How to integrate browser-grounded interaction with retrieval-state gating to jointly optimize navigation and evidence exposure?
  • Can training agents with explicit supervision on retrieval-state transitions improve preservation of site-specific retrieval context?
  • How to extend SGR capabilities to handle dynamic or frequently changing website controls and state dependencies?
  • What adversarial or robustness challenges arise when agents attempt to bypass or manipulate site-specific retrieval gating?

Why it matters for bot defense

SGR-Bench highlights a crucial dimension of bot defense and automated agent design: effectiveness on specialized retrieval platforms depends not only on locating relevant sources but on maintaining precise, stateful control of site-specific retrieval settings (filters, views, scopes). For bot-defense engineers, this implies that automated crawlers or scraping agents may produce incomplete or incorrect data unless they model retrieval-state dependencies properly.

Moreover, CAPTCHA or human verification challenges could be selectively placed to detect forced retrieval-state drift or automated trial-and-error attempts to guess correct filters, potentially revealing bot behavior. Captcha systems might also consider the complexity of retrieval-state gating when defining human-session behavioral baselines, distinguishing scraping bots that fail to maintain state coherence. Finally, SGR-Bench’s paired explicit/implicit query formulations suggest new evaluation axes for bot detection based on agents’ capability in structured retrieval-state control.

Cite

bibtex
@article{arxiv2605_22219,
  title={ SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval },
  author={ Ningyuan Li and Haiyang Shen and Mugeng Liu and Yudong Han and Zhuofan Shi and Sixiong Xie and Yun Ma },
  journal={arXiv preprint arXiv:2605.22219},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22219}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution