Self-Improving Language Models with Bidirectional Evolutionary Search
Source: arXiv:2605.28814 · Published 2026-05-27 · By Guowei Xu, Zhenting Qi, Huangyuan Su, Weirui Ye, Himabindu Lakkaraju, Sham M. Kakade et al.
TL;DR
This paper addresses key limitations in current sample-generation methods used for self-improving language models and agentic systems, specifically best-of-N sampling and tree search. These traditional methods rely on sparse verification signals and generate candidates mainly through autoregressive expansions, restricting exploration to high-probability regions under the model’s distribution. To overcome this, the authors propose Bidirectional Evolutionary Search (BES), a novel search framework combining forward candidate evolution and backward goal decomposition. Forward search augments autoregressive expansion with four evolutionary operators—combination, deletion, translocation, and crossover—to recombine and edit candidate trajectories beyond single rollout extensions. Backward search decomposes the original problem into detailed sub-goals that provide dense, intermediate verification signals to guide the forward search more effectively.
Theoretical analysis shows expansion-only search confines candidates to a narrow entropy shell that limits exploration, while evolutionary operators can escape this shell by breaking inter-block dependence and increasing diversity. Moreover, backward search exponentially reduces the number of samples needed to find correct solutions by leveraging fine-grained sub-goal verification. Empirically, BES outperforms strong baselines on challenging post-training logical and multi-hop reasoning tasks where standard methods struggle or degrade performance. At inference time, BES achieves higher quality and more stable solutions on several open problem solving benchmarks compared to top open-source evolutionary frameworks. Ablations confirm the necessity of both directions and evolutionary operators. Overall, BES provides a theoretically principled and empirically validated approach for enhancing language model search beyond previous autoregressive and verification limitations.
Key findings
- Expansion-only search candidates concentrate in a narrow entropy shell with probability at least 1-exp(-Omega(T)) (Theorem 4.4a), limiting candidate diversity.
- Evolution operators generate candidates with expected log-probability beyond the entropy shell boundary by Omega(T), enabling escape from model distribution confines (Theorem 4.4b).
- Backward goal decomposition reduces sample complexity exponentially from Omega(1/prod(pi)) to O(pmin^-1 log(m/delta)) for collecting sub-goal evidence, where pi are sub-goal success probabilities (Theorem 4.5).
- On Knights-and-Knaves logical reasoning post-training, BES steadily improves validation accuracy versus nearly flat or degrading baselines GRPO and MaxRL (Fig. 3).
- In multi-hop reasoning post-training on MuSiQue, BES improves accuracy by +3.0% on Llama-3.2-3B and +3.8% on Llama-3.1-8B versus GRPO and Tree-GRPO baselines (Table 1).
- At inference with GPT-5 backbone on three open problem solving benchmarks, BES achieves the best average and best objective values compared to OpenEvolve, GEPA, and ShinkaEvolve (Table 2).
- Ablation studies show removing answer reweighting or evolutionary operators reduces BES performance but still outperforms baselines, confirming both components’ importance (Fig. 4).
- BES incurs ~30% more wall-clock training time than Tree-GRPO but gains significantly better accuracy and higher valid search/action rates (Table 3).
Threat model
The adversary in this context is implicit and conceptualized as the challenge of finding correct, high-quality solutions (trajectories) for complex problems using language models and verifier feedback. The adversary cannot bias or alter the verifier beyond its defined rule-based or learned function and cannot produce candidate trajectories outside the policy distribution without search operators. The capability of the adversary is effectively bounded by the sampling and verification mechanisms, motivating the search enhancements to overcome sparse signals and limited sampling coverage.
Methodology — deep read
The paper formalizes the problem as maximizing a verifier score V(x,y) over terminal trajectories y for input problem x, produced by a policy πθ that generates stepwise trajectories y = (y1,...,yT). The challenge is that the optimal y⋆ lies in low-probability regions under πθ, making naive sampling inefficient. The threat model assumes an adversary limited to standard model sampling and verification signals; the approach focuses on enhancing exploration and verification quality.
Data: The authors evaluate on several benchmarks including Knights-and-Knaves logical puzzles (~5k training problems with 1.3k validation), MuSiQue multi-hop QA (~3-4 hop subset), and three open problem solving datasets (Circle Packing Square, Rectangle, and Heilbronn Convex) using various LLM backbones (Gemma-3-1B, Llama-3.2-3B, Llama-3.1-8B, GPT-5). Verifiers are task-dependent and include rule-based checkers, code executors, or embedding similarity models.
Architecture/Algorithm: BES consists of two coupled components:
Forward search maintains a candidate pool P of partial trajectories. It applies two operator types to produce new candidates at each step:
- Expansion: autoregressively sample 1..K steps from πθ to extend a candidate.
- Evolution: four operators (combination, deletion, translocation, crossover) that edit or recombine segments from one or two parent trajectories in P to produce diverse new candidates beyond single rollouts. Parent nodes for operators are selected by a Boltzmann distribution over scores derived from backward verification.
Backward search builds a recursive goal tree by decomposing the problem into sub-goals via policy prompts. Each node n in forward search is scored against this goal tree by aggregating verifiers for the node's solution coverage of sub-goals. This provides dense, graded feedback rather than sparse terminal signals.
The forward and backward searches alternate, with backward search invoked every K forward steps to refine sub-goals and rescore candidates, guiding exploration.
Training regime: For post-training, BES is used as a sample generation module that replaces i.i.d. rollouts to produce higher quality trajectories for reinforcement learning algorithms like MaxRL and GRPO. Training is run for multiple epochs as detailed per experiment. Computation uses standard LLM hardware stacks; explicit seeds and hyperparameters include λ=0.1 for exploration bonus, linear temperature annealing for operator selection.
Evaluation metrics include validation accuracy (logical and multi-hop reasoning), number of valid searches/actions, finish ratio, mean and best objective function values on open problems, wall-clock time, and API cost. Baselines include GRPO, MaxRL, Tree-GRPO for post-training and OpenEvolve, GEPA, ShinkaEvolve for inference.
Empirical evaluation includes ablations on evolutionary operators and answer reweighting to isolate component impacts. Cost analyses compare compute overhead versus achieved improvements.
The paper provides theoretical proofs underpinning the entropy shell confinement of expansion-only search and superiority of evolutionary operators, as well as sample complexity benefits from backward-guided sub-goal search.
Code and trained models are publicly available, enhancing reproducibility.
Overall, the methodology builds a rigorous, multi-faceted search framework with evolutionary diversity and decomposed verification that is practically trained and evaluated on challenging benchmarks with clear efficiency and quality gains.
Technical innovations
- Introduction of evolutionary operators (combination, deletion, translocation, crossover) for recombining partial trajectories in forward search to escape the narrow entropy shell of expansion-only generation.
- Coupling of forward search with backward goal decomposition that recursively breaks problems into verifiable sub-goals, providing dense intermediate feedback to guide search.
- Theoretical proof that expansion-only sampling confines candidates within a narrow entropy shell, while evolution operators allow escape beyond this region to improve exploration.
- Theoretical result that backward search exponentially reduces the samples needed to find correct answers by leveraging fine-grained sub-goal success probabilities.
Datasets
- Knights-and-Knaves — ~5,000 training problems, 1,300 validation — publicly described puzzle dataset
- MuSiQue (3-4 hop subset) — not explicitly sized here — public multi-hop QA dataset
- Circle Packing (Square) — unspecified size — open problem solving benchmark
- Circle Packing (Rectangle) — unspecified size — open problem solving benchmark
- Heilbronn Convex — unspecified size — open problem solving benchmark
Baselines vs proposed
- GRPO post-training (Logical Reasoning): validation accuracy baseline stagnant; BES steadily improves accuracy (Fig 3)
- MaxRL post-training (Logical Reasoning): minimal gains; BES outperforms both (Fig 3)
- GRPO post-training (MuSiQue Multi-hop Reasoning): Accuracy 2.1% vs BES 7.0% (+4.9%) for Llama-3.2-3B (Table 1)
- Tree-GRPO post-training (MuSiQue): Accuracy 3.9% vs BES 7.0% (+3.1%) for Llama-3.2-3B; 7.4% vs 10.4% (+3.0%) for Llama-3.1-8B (Table 1)
- OpenEvolve inference (Circle Packing Sq.): Avg 2.531 vs BES 2.623 (+0.092) with GPT-5 backbone (Table 2)
- GEPA inference (Circle Packing Sq.): Avg 2.613 vs BES 2.623 (+0.01) (Table 2)
- ShinkaEvolve inference (Circle Packing Sq.): Avg 2.464 vs BES 2.623 (+0.159) (Table 2)
- BES shows lower variance and higher best-case performance across all three open problem solving benchmarks (Table 2)
Limitations
- The backward search relies on the quality of sub-goal decomposition and verifiers, which can be task-dependent and may require manual engineering or suitable heuristic functions.
- Evolutionary operators are currently defined as direct edits on step sequences; in domains without straightforward concatenation, they rely on prompting heuristics that may be less precise.
- The paper does not extensively evaluate adversarial robustness or the impact of highly misleading verifiers prone to reward hacking.
- Experiments are limited to selected reasoning and open problem datasets; distribution shift or generalization to very different tasks is not studied in depth.
- Some method components (e.g., answer reweighting in post-training) interact with BES, making it hard to isolate effects fully.
- Computational cost overhead (~30% higher wall-clock time than Tree-GRPO) may be prohibitive in resource-constrained settings.
Open questions / follow-ons
- How to automatically and robustly generate or learn effective sub-goal verifiers across diverse domains without manual engineering?
- Can evolutionary operators be further extended or adapted to more abstract or continuous action/state spaces beyond discrete token or step sequences?
- How does BES perform in adversarial or noisy verifier settings where feedback signals may be misleading or incorrect?
- What are the theoretical and empirical limits of BES scalability in very high-dimensional trajectory spaces or with very long-horizon tasks?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners focused on improving robustness against advanced language-model-based solvers, BES offers a principled approach to generate more diverse and higher-quality candidate solutions beyond what autoregressive expansions can achieve. Its bidirectional framework, combining fine-grained sub-goal decomposition with evolutionary search, could inspire new CAPTCHA challenge designs that require solving compositional sub-goals verified independently, making automatic solving by bots computationally more challenging.
Moreover, understanding that usual generation methods are confined to narrow hypothesis shells indicates a potential weakness exploitable in CAPTCHA construction by enforcing solutions in low-probability regions. Simultaneously, defenders might consider integrating evolutionary-style recombination detection or multi-goal verification feedback to detect or impede automated solver attempts employing BES-like strategies. Overall, BES enhances the toolkit for reasoning about bot solvers and designing challenges robust to their advanced search capabilities.
Cite
@article{arxiv2605_28814,
title={ Self-Improving Language Models with Bidirectional Evolutionary Search },
author={ Guowei Xu and Zhenting Qi and Huangyuan Su and Weirui Ye and Himabindu Lakkaraju and Sham M. Kakade and Yilun Du },
journal={arXiv preprint arXiv:2605.28814},
year={ 2026 },
url={https://arxiv.org/abs/2605.28814}
}