Towards Self-Evolving Agentic Literature Retrieval

Source: arXiv:2605.14306 · Published 2026-05-14 · By Yuwen Du, Tian Jin, Jing Kang, Xianghe Pang, Jingyi Chai, Tingjia Miao et al.

TL;DR

This paper addresses the dual challenge in scientific literature retrieval of deeply understanding complex, multi-constraint research intents and guaranteeing that retrieved sources are authentic and verifiable. Traditional keyword-based systems fail to capture nuanced intents, while generative large language models (LLMs) provide better semantic understanding but suffer from high computational costs and hallucinated or fabricated papers. To overcome these trade-offs, the authors propose PaSaMaster, a self-evolving agentic retrieval system that iteratively refines search intent from ranked evidence, treats retrieval as an intent–paper relevance ranking problem to eliminate hallucinations, and separates computationally expensive intent planning to a frontier LLM while delegating large-scale retrieval and scoring to lightweight models operating on verified corpora. Evaluated on PaSaMaster-Bench, a novel benchmark of 244 complex literature search tasks spanning 38 scientific disciplines, PaSaMaster significantly outperforms existing keyword-based, semantic, generative, and fixed-pipeline agentic retrieval methods. Compared to Google Scholar, it improves F1-score by over 15×, surpasses state-of-the-art generative LLMs including GPT-5.2 by 30%, all while maintaining zero hallucination and using only 1% of the computational cost of those LLMs. This demonstrates the promise of self-evolving, evidence-grounded retrieval as a scalable and trustworthy approach for agentic AI-assisted scientific knowledge discovery.

Key findings

PaSaMaster improves F1-score@20 by 15.6× over keyword-based Google Scholar (1.39% to 21.69%) on PaSaMaster-Bench.
PaSaMaster achieves 0% source hallucination rate, while generative LLM baselines exhibit hallucination rates up to 37.79% (MiniMax-M2.7).
PaSaMaster outperforms generative LLM GPT-5.2 by 30.0% F1-score (21.69% vs 16.69%) while using 1% of GPT-5.2’s token cost ($0.05 vs $6.06).
PaSaMaster attains highest scores on recall (31.84%), precision (22.19%), and NDCG (37.93%)@20, outperforming semantic retrievals (Bohrium NDCG 22.39%) and fixed-pipeline agentic Google Scholar Labs (F1 18.87%).
Generative LLM baselines show hallucinations across citation fields: title, author, date, link errors, undermining source reliability.
PaSaMaster’s self-evolving retrieval iteratively refines intents using ranked evidence, improving coverage and alignment with complex constraints.
Cost-efficient separation architecture limits expensive frontier LLM usage to planning, yielding large-scale retrieval with lightweight scorers at low cost.
PaSaMaster-Bench covers 244 expert-curated complex queries with strict constraint checklists spanning 38 disciplines, emphasizing realistic multi-constraint intents.

Threat model

The system assumes an adversary who cannot inject fabricated or malicious papers into the verified scientific corpus. The threat model centers on preventing hallucinated or fabricated retrieval results from LLM generation. Adversaries are not assumed to directly manipulate verification tooling or knowledge repositories. The defense targets hallucination mitigation and fidelity to real indexed papers only.

Methodology — deep read

Threat Model & Assumptions: PaSaMaster assumes an adversary does not manipulate the underlying indexed scientific corpora. The adversary is not explicitly modeled but the system aims to prevent hallucinated (fabricated) papers by relying solely on verified corpora and evidence extraction. The system focuses on ensuring source authenticity and deep comprehension of complex research intents from natural-language queries.
Data: The PaSaMaster-Bench benchmark was constructed from 244 expert-curated literature search tasks spanning 38 scientific disciplines. Each query reflects a real complex research bottleneck, decomposed into explicit constraint checklists specifying verifiable criteria for retrieved papers. Multiple retrieval systems (web-enabled frontier LLMs, PaSaMaster retrieval, traditional web search) provide candidate papers. Domain experts annotate candidates against the checklists, admitting only perfectly satisfying papers as ground truth.
Architecture/Algorithm: PaSaMaster consists of two main agent roles—a Navigator and a Librarian swarm—operating over a customized scientific corpus structured in three tiers: metadata (Dmeta), abstracts (Dabs), and passage-level evidence chunks (Dchunk). Given a query q, the Navigator uses a frontier LLM policy πNav to generate a retrieval strategy S and verification checklist C encoding concrete relevance requirements. The Librarian agents {πLib} retrieve candidates using multiple retrieval tools, verify them by extracting evidence passages and applying a trained relevance Scorer model that outputs checklist-level scores and evidence-grounded rationales. Final paper relevance scores combine averaged checklist scores and calibrated confidence, followed by listwise reranking.

A key innovation is iterative self-evolving retrieval: after each retrieval round t, the Navigator reflects on ranked results Pscored(t), identifies intent coverage gaps or ambiguities, and updates S(t+1), C(t+1) for subsequent retrieval rounds, progressively refining the semantic understanding of the research intent.

Training Regime: The lightweight Scorer model used by Librarians is trained via knowledge distillation. Large multidisciplinary paper clusters generate synthetic natural-language queries, and the full PaSaMaster system produces noisy candidate sets for these queries. A stronger teacher model then annotates query–paper pairs with checklist satisfaction scores and evidence-backed rationales. This allows the Scorer to learn structured verification at lower inference cost. Specific training hyperparameters, epochs, and optimization details are not fully detailed in the paper.
Evaluation Protocol: Systems are evaluated on PaSaMaster-Bench tasks by comparing their top-20 returned paper lists against expert-verified ground truth sets, using Recall@20, Precision@20, F1@20, and NDCG@20. Additionally, hallucination rates (proportion of fabricated or unverifiable papers) and computational cost (token usage and dollar cost) are measured. Baselines include lexical retrieval (Google Scholar), semantic retrieval (OpenScholar, Bohrium), generative LLM agents (DeepSeek, Kimi, MiniMax, GLM-5, Gemini, GPT-5.2), and fixed-pipeline agentic systems (Google Scholar Labs).
Reproducibility: The authors released code and benchmark data at https://github.com/sjtu-sai-agents/PaSaMaster, facilitating reproduction of main results. The customized corpus and annotated PaSaMaster-Bench represent novel contributions.

Example end-to-end: Given a complex query, the Navigator agent analyzes language, generates an initial retrieval strategy and checklist. Librarians retrieve candidate papers using semantic retrieval, citation network expansion, and web-to-repository verification. For each candidate, passage-level evidence supporting individual checklist criteria is extracted. The Scorer evaluates satisfaction scores and provides evidence-grounded rationale. Results are ranked, then fed back to the Navigator, who updates the query interpretation and retrieval plan. This iterative process continues until convergence, yielding a final ranked list of verifiable, relevant scientific papers aligned with complex search constraints.

Technical innovations

A self-evolving retrieval framework where ranked evidence iteratively refines search intents for complex multi-constraint natural-language queries, unlike fixed one-shot queries.
Hallucination-free literature discovery formulated as grounding intent–paper relevance ranking over verified corpora with explicit evidence passages, rather than generative citation synthesis.
Separation of computationally expensive frontier LLM-based intent understanding from large-scale retrieval/scoring, enabling cost-efficient scalability with lightweight verifier models.
Construction of PaSaMaster-Bench, a multidisciplinary benchmark with 244 expert-curated complex queries and strict constraint checklists, providing rigorous evaluation of agentic retrieval.
Design and distillation of a lightweight evidence-grounded Scorer trained on noisy system outputs and teacher annotations to verify candidate papers at scale.

Datasets

PaSaMaster-Bench — 244 complex literature retrieval queries with expert-annotated ground-truth paper sets — curated by authors
Customized scientific corpora D comprising over 160M papers restructured into metadata, abstracts, and passage-level evidence chunks — internal corpus

Baselines vs proposed

Google Scholar (Lexical Retrieval): F1-score@20 = 1.39% vs PaSaMaster 21.69%
OpenScholar (Semantic Retrieval): F1-score@20 = 7.92% vs PaSaMaster 21.69%
Bohrium Science Navigator: F1-score@20 = 12.26% vs PaSaMaster 21.69%
DeepSeek-v3.2 (Generative LLM): F1-score@20 = 15.56%, hallucination 20.57%, cost $0.28 vs PaSaMaster 21.69%, 0 hallucination, $0.05
Kimi-K2.5 (Generative LLM): F1-score@20 = 17.36%, hallucination 35.67%, cost $0.16 vs PaSaMaster 21.69%, 0 hallucination, $0.05
MiniMax-M2.7: F1-score@20 = 15.11%, hallucination 37.79%, cost $0.18 vs PaSaMaster 21.69%, 0 hallucination, $0.05
GLM-5: F1-score@20 = 18.18%, hallucination 29.07%, cost $0.56 vs PaSaMaster 21.69%, 0 hallucination, $0.05
Gemini-3.1-pro: F1-score@20 = 12.48%, hallucination 32.41%, cost $0.38 vs PaSaMaster 21.69%, 0 hallucination, $0.05
GPT-5.2: F1-score@20 = 16.69%, hallucination 11.80%, cost $6.06 vs PaSaMaster 21.69%, 0 hallucination, $0.05
Google Scholar Labs (Fixed-Pipeline Agentic): F1-score@20 = 18.87%, hallucination 0% vs PaSaMaster 21.69%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.14306.

Fig 1

Fig 1: Overview of PaSaMaster. PaSaMaster is a self-evolving agentic literature retrieval system that separates intent-aware planning

Fig 2

Fig 2: Overview of the PaSaMaster-Bench data curation pipeline, comprising three stages. (1) Question Generation: domain

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Limitations

The paper does not report robustness under adversarial attacks or deliberate retrieval poisoning.
Details of Scorer training hyperparameters, architecture, and dataset construction are limited, potentially affecting reproducibility.
PaSaMaster-Bench, while multidisciplinary, covers 244 queries — scalability to larger or evolving scientific corpora is untested.
Evaluation focuses on complex queries but not on user interaction latency or experience in live retrieval.
The reliance on high-quality customized corpora and engineered checklists may limit generalization to open-domain web-scale retrieval.

Open questions / follow-ons

Can the self-evolving approach extend to open-domain web retrieval beyond curated scientific corpora while preserving authenticity?
How does PaSaMaster perform under evolving or noisy corpora where document metadata or texts may contain inaccuracies?
What are trade-offs in latency and user interaction experience when iterating multiple retrieval rounds in practice?
Can adversarial attacks on the Scorer model or Navigator agent (e.g., query perturbations) affect hallucination-free guarantees?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, PaSaMaster's design to ensure output authenticity and prevent hallucinated (fabricated) sources is analogous to preventing automated agents from spoofing or fabricating identity or proof within a system. Its use of evidence-grounded verification, iterative refinement, and separation of planning and execution could inspire better verification and adaptive challenge mechanisms in CAPTCHA or bot defense frameworks that require verifying the authenticity and intent coherence of human users versus automated agents. The explicit grounding in verified evidence parallels CAPTCHA’s role of requiring verifiable human inputs rather than probabilistic guesses. Moreover, the benchmark’s rigor in evaluation under complex multi-constraint queries could inform testing bot-defense systems with more realistic adversarial queries and adaptive interaction flows rather than static challenges.

Cite

bibtex

@article{arxiv2605_14306,
  title={ Towards Self-Evolving Agentic Literature Retrieval },
  author={ Yuwen Du and Tian Jin and Jing Kang and Xianghe Pang and Jingyi Chai and Tingjia Miao and Fenyi Liu and WenHao Wang and Sikai Yao and Yuzhi Zhang and Siheng Chen },
  journal={arXiv preprint arXiv:2605.14306},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.14306}
}

Towards Self-Evolving Agentic Literature Retrieval ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​