Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

Source: arXiv:2605.25920 · Published 2026-05-25 · By Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng et al.

TL;DR

This paper addresses a critical and overlooked problem in legal AI: the temporal consistency of legal reasoning when large language models (LLMs) are augmented with agentic search. Laws are amended periodically, and applying statutes outside their valid temporal context leads to incorrect legal conclusions, violating the fundamental principle of lex retro non agit. The authors observe that existing legal LLMs and search agents suffer from temporal bias linked to their training cutoff dates and rarely incorporate explicit temporal constraints in their retrieval queries. Moreover, web search alone often fails to retrieve precise statutory provisions at the article-level needed for accurate reasoning.

To solve these challenges, the paper proposes LegalSearch-R1, a novel reinforcement learning framework that jointly trains a 7B-parameter agent on temporally-indexed legal data spanning multiple amendment versions. The agent combines a temporal-aware local retrieval-augmented generation (RAG) module over a curated statute corpus, with online web search for broader legal commentary. Temporal filtering of candidates ensures applicability at the query’s reference year. RL training uses token-level GRPO with entropy-based advantage shaping to accelerate the learning of temporal query planning. On a new benchmark covering 13 legal tasks, LegalSearch-R1 significantly outperforms state-of-the-art deep research agents and specialized legal LLMs, excelling especially at temporal consistency and out-of-domain generalization.

This work demonstrates the critical importance and feasibility of explicitly modeling and enforcing temporal boundaries in legal agentic search, moving beyond static parametric knowledge or untargeted retrieval. It also showcases the value of coupling precise statute retrieval with flexible web search under RL, yielding a system better aligned with legal principles and real-world reasoning requirements.

Key findings

LegalSearch-R1 improves temporal consistency by 57.7% to 80.3% over strong deep research baselines on temporally indexed queries.
LegalSearch-R1 outperforms state-of-the-art legal LLMs and deep research agents by 12.9% to 29.8% averaged over 13 legal reasoning tasks.
On the Legal Article Recitation (LAR) task, requiring verbatim statute text recall, LegalSearch-R1 achieves 96.73 ROUGE-L versus 63.30 for best baseline (DeepSeek-V3.2), a 33-point improvement.
The RL agent increases temporally-aware search queries by 121% to 454% compared to baselines, learning to proactively embed temporal constraints without explicit query supervision.
Ablation removing the temporal-enhanced RAG module reduces training reward at convergence from 0.561 to 0.469, showing the critical impact of local, temporally-filtered statute retrieval.
LegalSearch-R1 with 7B parameters matches or outperforms models with 4x to 30x more parameters (e.g. Qwen-3 30B, DeepSeek V3.2) on out-of-domain professional exam tasks.
Baseline legal LLMs and agentic search systems show strong performance degradation on statute versions before or after their training cutoff years, confirming temporal bias.

Threat model

The threat model assumes an adversary capable of probing a legal reasoning system with queries tied to specific temporal contexts. The adversary cannot influence statute content but exploits model temporal biases to elicit incorrect legal conclusions by leveraging outdated or future statutes. The system defends by enforcing temporal filtering and learning to embed temporal context in query formulation, mitigating time-based knowledge drift.

Methodology — deep read

Threat Model & Assumptions: The paper considers an adversary scenario of legal AI systems that must provide correct legal reasoning answers grounded in temporally valid laws. The fundamental limitation is that models trained up to a cutoff date do not have updated legal knowledge and that retrieval agents do not incorporate temporal constraints, leading to misapplication of non-current statutes. The system’s goal is to ensure the temporal context (year) matches the applicable statute edition, following legal principle lex retro non agit.
Data: The authors curate a temporally-indexed benchmark consisting of 13 legal tasks, drawn from LawBench, LexEval, DISC-LawEval, and expert annotations. The dataset contains both in-domain tasks with training and test splits (3,584 training samples, 896 in-domain test) and six out-of-domain professional exam tasks (768 test samples). Questions explicitly embed temporal context (case year or statute effective year). The benchmark includes Legal Article Recitation (LAR), with crowdsourced annotations of 16 statutes across 13 amendment versions from 2000 to 2025.
Architecture / Algorithm: LegalSearch-R1 uses a 7B-parameter Qwen-based backbone. The agent follows a multi-turn reasoning loop under the ReAct framework, producing intermediate thoughts (<think>), plans (<plan>), tool calls (<tool_call>), and answers (<answer>). It has three external tools:

web_search: accesses an online API restricted to legal domains,
browse_webpage: extracts legal content verbatim, preserving temporal metadata,
temporal-enhanced rag_retrieve: local RAG retrieval over a curated Chinese statute corpus indexed by amendment periods,

The RAG module extracts query temporal references and legal keywords using an LLM analyzer, then applies temporal filtering to include only statute versions valid at the query year. Candidate statutes are scored by a weighted fusion of exact keyword matching, dense vector retrieval (FAISS), and BM25 sparse scores using Reciprocal Rank Fusion. This ensures precise statute recall with validated temporality.

Training Regime: The agent is trained end-to-end via token-level Grouped Relative Policy Optimization (GRPO), a PPO variant, maximizing a reward balancing syntactic correctness and task accuracy. An entropy-based advantage shaping term focuses learning on planning tokens that embed temporal constraints, accelerating the emergence of temporally-aware query formulation. Training proceeds over 112 steps with rollouts generating multi-turn legal reasoning and retrieval actions. Hyperparameters include shaping coefficient α and clipping parameter ϵ; precise values are in the appendix.
Evaluation Protocol: The evaluation covers 13 distinct legal tasks characterized by different answer forms. Metrics include ROUGE-L for Legal Article Recitation (LAR), binary LLM judge scores for legal consultation correctness, and accuracy for multiple-choice style tasks. The benchmark contains both in-domain and out-of-domain tests to assess generalization. Baselines span legal LLMs (DISC-LawLLM, LegalDelta), vanilla LLMs (Qwen series, GPT-4o-mini), large reasoning models, and recent deep research agents (Search-R1, DeepResearcher, DeepPlanner). Ablations test the effect of disabling the temporal RAG retrieval module.
Reproducibility: The authors release code and data at their GitHub repository. The statute corpus and temporally-indexed benchmark are publicly available. The backbone model Qwen-2.5-7B is used consistently for fair baseline comparison. Some baseline model weights (e.g., LRAS) are unavailable, so omitted. The appendix provides detailed prompts and tool schemas.

Concrete Example End-to-End: Given a 2010 criminal probation query, the agent extracts the temporal reference '2010' and legal keywords 'probation', 'criminal law'. It first executes an online web search scoped to documents around 2010, then browses relevant web pages for broader interpretation. Next, it calls the temporal RAG module to retrieve statutes (Art. 72 & 74) valid only in 2009-2011 period, filtering out 2023 amendments. Combining these evidence segments, the reasoning module concludes based on provisions effective at 2010, yielding that Yao is eligible for probation. Baselines without temporal filtering wrongly apply 2023 law, concluding ineligibility.

This detailed methodology reflects a novel integration of temporal awareness, a hybrid retrieval toolset, and RL fine-tuning for robust legal reasoning under real-world constraints.

Technical innovations

Integration of a temporal-filtered local statute RAG module paired with online web search for precise and broad legal knowledge retrieval.
End-to-end reinforcement learning training using token-level GRPO with entropy-based advantage shaping to accelerate learning of temporal query formulation in legal agentic search.
Construction of a temporally-indexed legal benchmark spanning 13 tasks across multiple statute amendment versions to explicitly evaluate temporal consistency.
Design of a hybrid query analyzer that extracts temporal references and legal keywords from free-text queries to enable effective temporal filtering of statute retrieval candidates.

Datasets

Temporally-indexed legal benchmark — 3,584 training + 1,664 test samples — aggregated from LawBench, LexEval, DISC-LawEval, and expert annotations, publicly released
Chinese statutory law corpus — covering 16 major statutes with 13 amendment versions (2000-2025) — curated by authors for RAG retrieval

Baselines vs proposed

DISC-LawLLM (with CoT): average score 25.92 vs LegalSearch-R1: 55.90
LegalDelta (without CoT): average 48.71 vs LegalSearch-R1: 55.90
GPT-4o-mini: average 16.35 vs LegalSearch-R1: 55.90
DeepSeek-V3.2: LAR ROUGE-L 63.30 vs LegalSearch-R1: 96.73
DeepPlanner: average 39.19 vs LegalSearch-R1: 55.90
LegalSearch-R1 without temporal RAG: converged reward 0.469 vs with RAG: 0.561
Out-of-domain tasks: Best baseline DeepPlanner accuracy 53.91 vs LegalSearch-R1 63.67

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.25920.

Fig 1: Temporal inconsistency in legal reasoning,

Fig 2: LAR ROUGE-L scores across amendment

Fig 3: Number of temporally-aware search queries.

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 4: Overview of LegalSearch-R1. The agent performs multi-turn legal reasoning over a hybrid online-local

Fig 8 (page 4).

Limitations

Evaluation and dataset focus exclusively on the Chinese civil law system, limiting generalizability to common law systems with stronger reliance on precedents.
The RAG corpus indexes only statutory provisions and excludes judicial precedent or case law, which are important in many jurisdictions.
Multilingual and cross-jurisdictional performance remains unassessed; only Chinese-language norms were tested.
Certain recent open-source legal LLMs were excluded due to incompatible output formatting, possibly omitting competitive baselines.
The approach depends on high-quality temporal annotations and statute indexing, which may be labor-intensive to replicate.

Open questions / follow-ons

How to extend temporal consistency methods to common law systems emphasizing precedent retrieval and interpretation rather than statutes alone?
Can the temporal-enhanced RAG and RL training methods be generalized to multilingual and multi-jurisdictional legal corpora with different amendment dynamics?
How to integrate temporally-aware retrieval and reasoning with diverse legal document types such as case law, administrative regulations, and commentary in a unified framework?
What are effective approaches to automate or scale the temporal annotation and statute indexing process across jurisdictions and languages?

Why it matters for bot defense

For practitioners building robust bot-defense and CAPTCHA systems in the legal AI domain or similarly time-sensitive fields, this work underscores the critical need to incorporate explicit temporal context during knowledge retrieval and reasoning. Static parametric knowledge is inadequate when domain rules change discretely over time, as stale or future knowledge leads to invalid conclusions. Enforcing temporal filtering and training models to integrate temporal signals via reinforcement learning can greatly improve system trustworthiness and reduce failure modes related to temporal drift.

Moreover, this paper offers a blueprint for combining local, structured retrieval with flexible web search tools under a learned policy that optimizes query planning. For CAPTCHA systems or bot-detection engines operating in evolving regulatory or policy contexts, adopting temporally-aware retrieval pipelines can enable higher fidelity in threat or rule matching. Overall, the insights promote rigorous temporal modeling as a critical dimension often overlooked in reinforcement learning guided search agents and knowledge-augmented LLMs.

Cite

bibtex

@article{arxiv2605_25920,
  title={ Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning },
  author={ Wei Fan and Yining Zhou and Mufan Zhang and Yanbing Weng and Yiran HU and Tianshi Zheng and Baixuan Xu and Chunyang Li and Jianhui Yang and Haoran Li and Yangqiu Song },
  journal={arXiv preprint arXiv:2605.25920},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.25920}
}

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​