LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
Source: arXiv:2605.31584 · Published 2026-05-29 · By Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
TL;DR
This paper addresses the challenge of long-context reasoning in large language models (LLMs), which struggle to locate and integrate relevant information within lengthy, distractor-filled documents. Existing reinforcement learning approaches with verifiable rewards (RLVR) show promise but suffer from limited training data quality and sparse outcome-only rewards that fail to supervise intermediate reasoning steps. The authors propose LONGTRACERL, combining a novel data construction pipeline that leverages multi-hop questions generated from knowledge graph random walks with search agent trajectories to build tiered distractors of varying confusability. This yields more realistic and difficult training contexts. Additionally, they design a rubric reward system that provides fine-grained process supervision by measuring recall of gold entities along the reasoning chain, applied selectively only to responses with correct final answers to discourage reward hacking. Experiments on five long-context benchmarks and three different model scales (4B, 8B, 30B) show that LONGTRACERL consistently outperforms strong baselines, improving average accuracy by up to 5.7 points on the 4B model and encouraging comprehensive, evidence-grounded reasoning.
Key findings
- LONGTRACERL improves average benchmark accuracy on Qwen3-4B-Thinking-2507 from 53.3 to 59.0 (+5.7 points) and surpasses previous best baseline LongRLVR by +2.5 points.
- On challenging AA-LCR benchmark, LONGTRACERL raises accuracy from 33.2 to 41.8 (+8.6 points) on the 4B model.
- Trajectory-tiered distractors contain 50.03% of documents overlapping with gold rubric entities by macro average, significantly higher than 1.35% for random distractors, yielding stronger training signals.
- Ablation of rubric reward (positive-only) reduces average score from 59.0 to 53.7, indicating rubric reward drives improvement.
- Best rubric reward weight α=0.3 outperforms α=0.1 and α=0.5, balancing outcome and intermediate reasoning supervision.
- Positive-only rubric reward outperforms positive&negative variant by +1.9 points (59.0 vs. 57.1), preventing reward hacking.
- LONGTRACERL shows gains across three model sizes (4B, 8B, 30B), with 4B and 30B showing +5.7 and +3.2 average improvements respectively.
- Rubric reward encourages longer, more deliberate response generation, mitigating shortcut reasoning.
Threat model
The threat model centers on the language model as an agent learning to maximize rewards when reasoning over long and distractor-rich contexts. The adversary is the model itself, which may attempt to game the reward signal (e.g., by shortcutting reasoning or enumerating entities to inflate rubric scores). External adversaries or attackers are not considered. The framework assumes correct final answer detection can be reliably performed by an LLM judge, and that distractors are realistic but fixed; adversaries cannot alter the context or inject malicious content.
Methodology — deep read
The paper tackles long-context reasoning failures by addressing two key limitations of prior RLVR methods: poor training data quality and overly sparse outcome-only rewards. The threat model assumes an LLM reasoning over extremely long contexts with realistic distractors that may confuse the agent; the adversary is implicitly the language model aiming to optimize rewards, with no external attackers considered.
Data is constructed via a four-step pipeline:
- Multi-hop question generation: Using the KILT Wikipedia snapshot, controlled 8-step random walks over the Wikipedia hyperlink graph produce entity chains. A powerful LLM (GPT-5.2) synthesizes paraphrased multi-hop questions with unique, answerable attributes of the final entity, along with intermediate gold entities for supervision.
- Agent search trajectories: A search agent executes multi-round queries, document opening, reading, and citing actions to answer each question. Multiple (K=5) independent trajectories are executed; only those that reach the correct answer are retained.
- Tiered distractor extraction: From agent trajectories, documents opened but not cited are tier-1 high-confusability distractors; documents appearing in search results but not opened are tier-2 low-confusability distractors.
- Long-context assembly: Gold passages are combined with tier-1 distractors first, then tier-2 distractors until a target context length (128K tokens) is reached, documents are shuffled to avoid positional bias.
The architecture is a standard reasoning-capable LLM (Qwen3 4B, 8B, or 30B variants) fine-tuned with reinforcement learning using a Group Relative Policy Optimization (GRPO) algorithm. The reward combines:
- Outcome reward (binary correctness of final answer), judged by LLM.
- Rubric reward: recall fraction of gold entities mentioned in the model’s reasoning response. Rubric rewards are group-normalized and applied only to responses with correct final answers (positive-only strategy) to prevent shortcutting and reward hacking.
Training regime: Using Slime framework, maximum context length 160K tokens (128K prompt + 32K generation). Batch size 128, group size G=8, learning rate 2e-6, 200 iterations on 32 × H800 GPUs. Rollouts at temperature 1.0 for exploration, evaluation at 0.6 with max length 32K tokens. Checkpoints saved every 20 steps; best checkpoint selected per run.
Evaluation: Five diverse long-context benchmarks (AA-LCR, MRCR, Frames, LongBench v2, LongReason) cover real-world, multi-hop, and synthetic scenarios ranging from 8K to 2M tokens. Metrics are accuracy or macro-averaged scores as appropriate. Ablations study the effect of rubric weight α, distractor strategy (random, search, traj-random, traj-tiered), and reward design (positive-only vs positive&negative). Multiple scales and model families are tested. Reproducibility: Code, weights, and datasets released; dataset queries based on KILT Wikipedia snapshot. Some details (e.g., exact LLM prompts for question synthesis) are described but not fully detailed.
Concrete example end-to-end: For a query about Lady Gaga and her singles produced by RedOne, the model is given a long contextual input assembled from multiple gold passages (Wikipedia articles about Lady Gaga, her songs), tier-1 distractors (related music producer articles), and tier-2 distractors (less relevant search results). The model’s response is scored with a composite reward combining a correct answer outcome and rubric reward based on referencing gold reasoning entities. This example highlights difference from prior easy distractors and outcome-only rewards that fail to distinguish reasoning errors.
Technical innovations
- Use of search agent trajectories to produce tiered distractors with graded confusability for more challenging and realistic long-context RL training data.
- Design of an entity-level rubric reward based on recall of gold entities along the reasoning chain providing fine-grained process supervision.
- Introduction of a positive-only rubric reward strategy that applies the rubric reward exclusively to correct final answers, preventing reward hacking.
- Integration of group-relative normalization of rubric rewards across response groups to stabilize reward scales despite varying question difficulty.
Datasets
- LONGTRACERL training set — 2,815 QA examples with 8-hop multi-hop questions from KILT Wikipedia snapshot — generated by this work
- DocQA — 1,591 QA examples with 2K–20K token context — prior work (Wan et al., 2025)
- LoongRL — 15,000 QA examples with 16K token context — prior work (Wang et al., 2025)
- LongRLVR — 18,870 QA examples with 8K–64K token context — prior work (Chen et al., 2026)
Baselines vs proposed
- Qwen3-4B-Thinking-2507 Base model: average score = 53.3 vs LONGTRACERL: 59.0 (+5.7)
- LONGTRACERL vs LongRLVR (8B backbone): average score 43.8 vs 40.9 (+2.9)
- Rubric reward ablation (LONGTRACERL-GRPO) vs full model (4B backbone): 53.7 vs 59.0 (+5.3)
- Distractor strategy: traj-tiered (59.0) vs traj-random (57.4), search (56.7), random (55.7) on 4B model
- Rubric weight α=0.3 (59.0) outperforms α=0.1 (58.3) and α=0.5 (57.1)
- Positive-only rubric reward (59.0) outperforms positive&negative (57.1) on 4B model
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.31584.

Fig 1: Comparison between prior long-context RL

Fig 2: Overview of the LONGTRACERL training data construction pipeline.

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- Training data is solely grounded on a single knowledge source: the KILT Wikipedia snapshot, limiting diversity of reasoning domains and patterns.
- Distractor construction depends heavily on capabilities of the deployed search agent; stronger/weaker agents may yield different distractor quality and impact training effectiveness.
- Rubric rewards require annotated gold entity chains, which may be costly or infeasible for more open-ended or less structured reasoning tasks.
- Evaluation focuses on Qwen family models; generalization to very different LLM architectures or modalities untested.
- The method implicitly assumes correct final answer detection by an LLM judge, which may introduce noise or bias into reward signals.
- No adversarial evaluation of robustness against purposeful manipulation of reasoning traces or intentionally confusing distractors.
Open questions / follow-ons
- How does the quality and design of the search agent affect the difficulty and usefulness of trajectory-derived distractors for training?
- Can the rubric reward framework be extended to open-domain or less structured reasoning tasks lacking clear gold entity chains?
- What are the trade-offs of different normalization and aggregation strategies for processing intermediate reasoning rewards?
- How does LONGTRACERL perform under adversarial or intentionally misleading distractors designed to confuse entity linking?
Why it matters for bot defense
From a bot-defense and CAPTCHA perspective, LONGTRACERL’s approach of leveraging rich, multi-hop reasoning traces and fine-grained process rewards to improve long-context understanding offers valuable lessons. The tiered distractor design based on realistic search trajectories highlights the importance of constructing sophisticated challenge instances that more closely mimic real adversarial information noise, rather than simple random or superficial distractors. This method could inspire CAPTCHA challenge generation strategies where bots are tested not just on outcome correctness (e.g., solving a puzzle) but also on demonstrating transparent, stepwise reasoning evidenced by references to intermediate 'gold' entities or features.
Furthermore, the rubric reward’s selective application only to correct outcomes to prevent shortcutting resembles approaches to enforce authenticity in bot behavior by requiring aligned intermediate justifications. Although LONGTRACERL targets LLM reasoning improvements rather than traditional bot detection, its ideas on designing reward signals that penalize shallow strategies and encourage evidence-grounded responses may help refine CAPTCHA systems that seek to differentiate human users from bots employing superficial heuristics or narrow pattern matches. Finally, the extensive long-context benchmarks and analytic methodologies here provide useful evaluation paradigms to test system robustness under realistic, hard distractors—a key challenge shared with bot detection and CAPTCHA robustness.
Cite
@article{arxiv2605_31584,
title={ LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards },
author={ Nianyi Lin and Jiajie Zhang and Lei Hou and Juanzi Li },
journal={arXiv preprint arXiv:2605.31584},
year={ 2026 },
url={https://arxiv.org/abs/2605.31584}
}