DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Source: arXiv:2606.12871 · Published 2026-06-11 · By Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu et al.

TL;DR

This paper addresses the challenge of evaluating Search Agents (SAs) that leverage large language models for complex, open-ended daily information-seeking tasks. Existing benchmarks focus on specialized or static domain-specific problems, which do not reflect the evolving, broad, and practical information demands of everyday users. To bridge this gap, the authors propose DailyReport—a large-scale, open-ended benchmark consisting of 150 real-world daily search tasks with 3,546 fine-grained evaluation rubrics. Tasks are decomposed into subtasks and evaluated with novel cascade rubrics along disentangled dimensions including instruction following, factuality, and rationality. The benchmark also incorporates a user-centric cascade performance attribution and aggregation to produce interpretable dimension-wise scores and a user preference score.

The authors evaluate 17 state-of-the-art agentic systems spanning native deep research agents, large language models augmented with search tools, and LLMs with agentic code frameworks. Results reveal that while current agents demonstrate strong instruction-following ability (near 0.96 average score), they struggle with factuality and rationality, especially on open-ended, analysis-centric tasks. User preference scores for all systems fall below the acceptable threshold, highlighting a significant gap between generated responses and real user expectations. The work contributes a dynamic, up-to-date, and user-grounded benchmark, along with a detailed evaluation methodology enabling diagnostic performance attribution for future research.

Key findings

DailyReport consists of 150 open-ended daily search tasks and 3,546 associated rubrics spanning 10 high-level domains and 35 fine-grained categories.
Evaluation decomposes tasks into subtasks assessed along three disentangled dimensions: instruction following, factuality, and rationality using cascade rubrics.
Instruction following scores are relatively high across 17 agentic systems (up to ~0.98), but factuality scores remain low (often below 0.75) and rationality scores are intermediate (~0.8).
User preference scores range from 2.17 to 2.89 (out of 4), with no system reaching the acceptable user satisfaction threshold of 3.
LLMs augmented with search tools outperform native deep research agents and LLMs running agentic code frameworks on DailyReport tasks (e.g., GPT 5.4 with search achieves UserPref 2.89).
Analysis-centric tasks show better instruction following and rationality but lower factuality compared to retrieval-centric tasks (Figure 3).
Trace analysis reveals search tool usage correlates strongly with performance; however, referencing quality remains insufficient since Reference_Support scores lag behind Reference_Ratio.
Robustness tests show low variance in scores across multiple evaluation runs with different judge LLMs, confirming the benchmark’s practical reliability.

Threat model

n/a — The paper does not focus on security or adversarial threat models but rather on benchmarking the functional capabilities of search agents in realistic user settings without malicious adversaries.

Methodology — deep read

Threat Model & Assumptions: The authors consider an adversarial-free setting focusing on measuring the capabilities of search agents to fulfill real-world daily user information needs reliably. The adversary is not explicitly modeled; rather, the evaluation assumes tasks derived from authentic user queries without intentional manipulation.
Data: The benchmark contains 150 open-ended tasks designed by experts based on trending topics collected from major social media platforms such as Facebook, Reddit, Twitter, Weibo, Xiaohongshu, and Zhihu. These tasks cover 10 broad domains and 35 subcategories, containing 3,546 rubrics annotated through a hybrid workflow combining human experts and large language models. Tasks are split as 100 retrieval-centric and 50 analysis-centric to reflect different cognitive and information demands.
Architecture / Algorithm: The key methodology innovation lies in the evaluation framework. Each task is decomposed into atomic subtasks grounded in user constraints ensuring traceability. The evaluation uses cascade rubrics applied along three orthogonal dimensions: instruction following (objective), factuality (verified via web search), and rationality (subjective logical coherence). Judgments are assigned from adapted judge LLMs trained or prompted to independently score each dimension per subtask. The cascade rubric structure respects hierarchical dependencies; for example, factuality is only evaluated if instruction following is satisfied.
Training Regime: Not applicable as this is an evaluation benchmark paper; however, the authors mention using Gemini-3-flash as the judge model for scoring, selected after a meta-evaluation comparing judge LLM accuracy and cost against human expert annotations.
Evaluation Protocol: Each SA’s response to each subtask is scored on the three dimensions and aggregated using a cascade attribution formula weighting subtasks based on user-perceived importance. This produces dimension-wise scores and an aggregated user preference score (1 to 4 scale representing unhelpful to perfect). The approach enables interpretable performance attribution and highlights critical failure points. The benchmark evaluates 17 diverse state-of-the-art systems grouped as native deep research agents, LLMs with search tools, and LLMs with agentic code frameworks.
Reproducibility: The authors release the dataset and code publicly on GitHub. The tasks include detailed annotations and rubrics to facilitate reproducibility. However, some evaluated SAs are proprietary models, limiting exact replication of all results. The judge model version and prompt details are provided for evaluator replication.

Concrete Example: For a task like "List the Chinese universities in the 2026 QS Top 100 rankings and analyze their strengths and weaknesses," the evaluation first decomposes the task into subtasks (e.g., listing universities, analyzing strengths, analyzing weaknesses). Instruction following checks if the response fully covers the list. If yes, factuality verifies university names and ranking claims via web search. Finally, rationality judges the logical consistency of the analysis. Each subtask’s rubric score cascades into the overall dimension scores, weighted by subtask importance derived from user preference ablations.

Technical innovations

A novel user-centric cascade evaluation pipeline that decomposes open-ended tasks into subtasks and applies hierarchical rubrics along disentangled dimensions (instruction following, factuality, rationality).
Design of cascade performance attribution and user-centric aggregation algorithms that incorporate subtask importance to produce interpretable dimension-wise scores and user preference metrics.
A large-scale, continuously evolving benchmark dataset capturing daily real-world user search needs from trending multi-platform topics, covering broad domains and task types not seen in prior specialized benchmarks.
Use of judge LLMs (e.g., Gemini-3-flash) for rubric assessment incorporating web search verification, improving evaluation automation while maintaining alignment with expert human annotations.

Datasets

DailyReport — 150 tasks with 3,546 rubrics — constructed from trending topics and user comments from platforms including Facebook, Reddit, Twitter, Weibo, Xiaohongshu, Zhihu (public release on GitHub).

Baselines vs proposed

Native Deep Research Agents (e.g., OpenAI o3 Deep Research): UserPref = 2.42, Instruction Following = 0.967, Factuality = 0.616, Rationality = 0.856
LLMs with Search Tools (e.g., GPT 5.4): UserPref = 2.89, Instruction Following = 0.982, Factuality = 0.835, Rationality = 0.930
LLMs with Claude Code (e.g., CC-GPT 5.4): UserPref = 2.87, Instruction Following = 0.989, Factuality = 0.813, Rationality = 0.933
Across systems, UserPref scores < 3 (acceptable threshold), indicating all fall short of user expectations
Analysis-centric tasks yield higher instruction following (+0.03 avg) and rationality (+0.02 avg) but lower factuality (-0.04 avg) compared to retrieval-centric tasks (Figure 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12871.

Fig 1: DailyReport structure. We construct daily search tasks and cascade rubrics for evaluating

Fig 2: provides the detailed task characteristics of DailyReport. Compared with existing researches,

Fig 3 (page 2).

Fig 4 (page 4).

Limitations

Evaluation relies on judge language models whose scoring may not perfectly correlate with human judgment despite high reported accuracy, especially on subjective metrics.
Benchmark tasks are fixed at 150, not covering all possible domains or evolving topics beyond current trending data.
User preference scores aggregate multiple metrics but still simplify complex user satisfaction into four levels, potentially losing granularity.
Absence of explicit adversarial or robustness testing against malicious agent behaviors or deliberate misinformation generation.
Limited analysis of latency, computational cost, or scalability of evaluated agents in practical deployment scenarios.
Evaluation primarily assesses textual report quality, lacking multimodal or interactive agent capacities.

Open questions / follow-ons

How can future search agents improve factuality given the high error rates even when references are provided?
What advanced techniques can better integrate real-time web retrieval with reasoning to handle evolving and conflicting information robustly?
Can evaluation frameworks extend to interactive, multi-turn search dialogs or multimodal information synthesis?
How to design agents and benchmarks to be robust against adversarial misinformation or deceptive content sources?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, DailyReport offers a framework to rigorously evaluate the reasoning and factual grounding capabilities of autonomous search agents in open-ended, real-world information tasks. Understanding the limitations in factuality and rationality identified by this benchmark can inform the development of detection mechanisms distinguishing between genuine human-like search behavior and bot-generated outputs that may hallucinate or misrepresent facts. The cascade rubric methodology and user-centric aggregation could be adapted to evaluate the trustworthiness and alignment of agent responses, crucial for preventing automated misinformation spread or deceptive interactions. Furthermore, analyzing search-tool invocation patterns and reference quality metrics offers insights on operational signals that bot-detection systems might monitor to assess behavioral authenticity beyond superficial language fluency.

Cite

bibtex

@article{arxiv2606_12871,
  title={ DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks },
  author={ Jingxuan Han and Wei Liu and Mingyang Zhu and Youpeng Wang and Ziwen Wang and Lin Qiu and Xuezhi Cao and Xunliang Cai and Zheren Fu and Licheng Zhang and Zhendong Mao },
  journal={arXiv preprint arXiv:2606.12871},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12871}
}

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​