Context Training with Active Information Seeking
Source: arXiv:2605.13050 · Published 2026-05-13 · By Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam et al.
TL;DR
This paper addresses a major limitation of current large language model (LLM) adaptation methods that optimize model context without updating weights. Existing context optimizers operate as closed systems relying solely on the model's intrinsic knowledge, which limits their ability to incorporate newly produced, niche, or external knowledge. The authors propose augmenting context optimizers with active information seeking capabilities through external tools like Wikipedia search and web browsers. However, naive integration of these tools in a standard sequential context optimization pipeline leads to context pollution and degraded performance. To overcome this, they introduce a beam-search-style training procedure that maintains and prunes multiple candidate context versions, enabling robust exploration and reduction of noisy or harmful context updates. Evaluated across diverse challenging domains—including low-resource machine translation (FLORES+), healthcare dialogues (HealthBench), and reasoning-intensive coding and exam tasks (LiveCodeBench and Humanity's Last Exam)—their approach substantially improves over prior context training and strong model baselines. The method is data-efficient, hyperparameter robust, and generalizes well across different models.
Key findings
- Directly augmenting sequential context training with information seeking tools (Seq-IS) reduces average FLORES+ translation score from 31.13 to 29.68, demonstrating context pollution without search-based control.
- BeamSearch-IS improves average FLORES+ translation score to 34.51, a 4.14 point gain over best non-IS baseline (BoN=31.94) and outperforms the larger Gemini-2.5-Pro model (30.37).
- On HealthBench, Seq-IS yields 0.4484 overall score versus 0.4629 for Seq (no IS), but BeamSearch-IS achieves 0.5026, comparable to Gemini-2.5-Pro (0.5030).
- BeamSearch-IS outperforms baselines in emergency referrals on HealthBench, indicating better recognition and routing capabilities.
- In reasoning-heavy tasks LiveCodeBench and Humanity’s Last Exam, BeamSearch-IS improves pass@1 and average@8 accuracy (e.g., 8.63% on HLE vs 6.5% baseline), showing external info aids complex reasoning.
- Sequential training exhibits context collapse and local optima effects (Figs. 2 & 3), with optimizer stuck repeatedly expanding and pruning dictionary entries.
- Beam search maintains diverse candidate contexts and evaluates them on validation sets to prune noisy or low-quality updates, acting as a verification and backtracking mechanism.
- The method is sample-efficient, achieving gains with only 128 training and 64 validation examples in low-resource regimes.
Threat model
The implicit adversary in this system is the risk of external information pollution as the optimizer agent queries uncontrolled web sources and may acquire and incorporate inaccurate, irrelevant, or harmful knowledge into the context. The system assumes the external information may be noisy but does not consider an active adversary attempting to exploit or fool the information-seeking tools. The adversary cannot alter the model weights or control the internal execution of the LLM backbone, only exert influence through external data contamination.
Methodology — deep read
Threat Model and Assumptions: The adversary is implicit in the context pollution risk from external web sources. The optimizer agent lacks perfect knowledge and may inject noisy or misleading content when searching external tools. No adversarial attacks are explicitly studied. The system assumes access to external tools (Wikipedia, BrowserUse) and a frozen LLM backbone (Gemini-2.5-Flash) that cannot update parameters but can update context state.
Data: The paper evaluates on four benchmarks selected for requiring external knowledge retrieval or reasoning:
- FLORES+ low-resource translation into five less-common languages, with 128 training and 64 validation samples.
- HealthBench physician-authored clinical dialogues with rubric-based scoring, same small training budget.
- LiveCodeBench and Humanity's Last Exam reasoning and programming datasets with ~100 samples per subdomain. Splits and datasets provided in Appendix 8.2, data sourced from prior public benchmarks.
- Architecture/Algorithm: The core system models learning as state optimization on modifiable state S representing text context C (stored as a structured database with unique resource items containing summaries, raw content, metadata, and embeddings). Two LLM-based agents operate iteratively:
- Executor agent runs task instances conditioned on current context.
- Optimizer agent reads executor output and feedback, then updates the context. Optimizer is augmented with external tools:
- WikipediaSearchTool for quick info retrieval on knowledge gaps.
- BrowserUseTool for dynamic web navigation and complex info extraction. Contexts are managed as version-controlled code repositories enabling branching, committing, and reverting.
Training Regime: Standard sequential (greedy) approach updates one context per step based on training batch feedback. This suffers from context pollution and local optima. Authors propose beam-search training maintaining K candidate contexts per step. At each iteration, each candidate is expanded into M child contexts via optimizer updates plus external tool calls exploring diverse update strategies. Candidates are then pruned based on validation performance (held-out set), with elitism preserving best old candidate as a 'Do Nothing' fallback. This encourages exploration while discarding harmful noisy updates. Training is conducted with limited labeled data (128 train / 64 val examples), using the Gemini-2.5-Flash LLM as backbone.
Evaluation Protocol: Metrics vary by dataset:
- FLORES+: ChrF++ translation scores.
- HealthBench: rubric-based numeric scores.
- LiveCodeBench: pass@1 and pass@8 correctness.
- Humanity’s Last Exam: average@8 accuracy. Baselines include the base LLM zero-shot, Best-of-N (BoN) heuristic, sequential context training (Seq), and their beam search version without information seeking. Also compare sequential and beam methods augmented with information seeking (Seq-IS, BeamSearch-IS). Validation sets are used for context candidate pruning during training.
- Reproducibility: Details on context database schema, tool interfaces, and prompts are in appendices. Code or weights release is not mentioned explicitly. Datasets are established public benchmarks. Training hyperparameters and data splits are described in supplementary material.
Concrete Example: On FLORES+ translation, starting from a baseline context, the beam search optimizer generates multiple candidate updated contexts by using Wikipedia search or web browsing to find relevant dictionaries, grammar rules, or examples. These candidates are expanded across iterations, tested on held-out validation samples, and only the best-performing contexts survive, preventing the optimizer from accepting noisy web information that causes context pollution. This process yields an optimized context that significantly improves translation quality as measured by ChrF++.
Technical innovations
- Augmentation of context optimizer agents with external information-seeking tools (WikipediaSearch and BrowserUse) enabling retrieval of missing knowledge beyond frozen LLM parameters.
- Formulating context as a structured, version-controlled resource database that supports fine-grained read/write operations for precise and recoverable updates.
- Beam-search training procedure maintaining multiple candidate contexts per step to enable exploration and validation-based pruning, mitigating context pollution and escaping local optima in context optimization.
- Integration of a 'Do Nothing' fallback candidate during beam search to ensure robustness against harmful or noisy information seeking updates.
Datasets
- FLORES+ — diverse low-resource language translation dataset used with 128 training and 64 validation samples
- HealthBench — clinical multi-turn conversation benchmark with rubric-based scoring, ~128 training samples
- LiveCodeBench — reasoning and competitive coding tasks, ~100 samples per subdomain
- Humanity’s Last Exam (HLE) — complex exam tasks covering math, physics, and reasoning, ~100 samples per domain
Baselines vs proposed
- FLORES+ translation: Seq (no IS) ChrF++ = 31.13 vs Seq-IS 29.68 vs BeamSearch-IS 34.51 vs BoN 31.94 vs Gemini-2.5-Pro 30.37
- HealthBench overall score: Seq 0.4629 vs Seq-IS 0.4484 vs BeamSearch-IS 0.5026 vs Gemini-2.5-Pro 0.5030
- LiveCodeBench pass@1: Gemini-2.5-Flash baseline ~49% vs BeamSearch-IS higher by ~3.9% on hard subsets
- Humanity's Last Exam average@8 accuracy: baseline 6.5% vs BeamSearch-IS 8.63% vs Seq-IS 5.38%
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.13050.

Fig 1 (page 1).

Fig 7: | Data efficiency and hyperparameter robustness. (a) The heatmap compares different methods across varying

Fig 8: | The heatmaps visualize the utility of retrieved resources (x-
Limitations
- No explicit adversarial robustness or security evaluation against attacks on external tools or query manipulation.
- Dependence on external tools (Wikipedia, web browser) limits applicability in offline or restrictive environments.
- Computational overhead due to beam search maintaining multiple context candidates, increasing cost compared to sequential training.
- No detailed ablation on the sensitivity of beam size K or the number of expansions M across tasks.
- Use of only a small set of languages and domains; broader generalization to very different NLP tasks remains to be demonstrated.
- Code and model weights are not publicly released, limiting reproducibility by the community.
Open questions / follow-ons
- How robust is the system to adversarial attacks specifically targeting the external information sources, e.g., misinformation or adversarial web content?
- Can the beam search procedure be scaled or adapted for much larger-scale tasks or continuous online learning scenarios?
- What is the impact of expanding to other or more diverse external tools beyond Wikipedia and general web browsing?
- How does the method perform with larger backbone models or in zero/few-shot transfer to entirely new domains not covered in training?
Why it matters for bot defense
For bot-defense and CAPTCHA developers, this paper illustrates critical challenges and mitigations when leveraging external data sources within model contexts. The discovery that naively incorporating web-based tools can degrade system performance due to context pollution highlights the importance of controlled, multi-candidate search and validation mechanisms when automating information retrieval. Their beam search approach to maintain diverse candidate context states and reject poor updates via validation aligns with robust prompt or context management strategies needed to ensure the integrity of adaptive systems exposed to potentially adversarial external inputs. Moreover, the structured context database abstraction and version control enable precise, reversible edits that could inspire similar engineering designs in CAPTCHA systems that incorporate external knowledge or dynamic prompt generation. While their findings focus on natural language generation, the broader principle of balancing information retrieval with noise control is highly applicable to maintaining the security and reliability of adaptive bot-defense models.
Cite
@article{arxiv2605_13050,
title={ Context Training with Active Information Seeking },
author={ Zeyu Huang and Adhiguna Kuncoro and Qixuan Feng and Jiajun Shen and Lucio Dery and Arthur Szlam and Marc'Aurelio Ranzato },
journal={arXiv preprint arXiv:2605.13050},
year={ 2026 },
url={https://arxiv.org/abs/2605.13050}
}