ExpRL: Exploratory RL for LLM Mid-Training

Source: arXiv:2606.17024 · Published 2026-06-15 · By Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar

TL;DR

This paper addresses the challenge of improving large language model (LLM) reasoning capabilities through reinforcement learning (RL) when sparse reward signals are insufficient due to limited base model coverage over productive reasoning paths. Traditional mid-training methods rely on manually curated reasoning traces to teach primitive skills, but these require manual specification and may not sufficiently scale to harder problems needing broader solution strategies. ExpRL is proposed as an automated RL-based mid-training approach that leverages large corpora of human-written question-answer pairs as reference solutions, not as imitation targets but as reward scaffolds to generate dense outcome- and process-level rewards through an LLM judge. By assigning graded feedback on partial progress during on-policy sampling, ExpRL broadens the model's coverage of productive reasoning paths and yields better initializations for subsequent sparse-reward RL.

Key findings

ExpRL variants outperform SFT, sparse-reward GRPO, and self-distillation for downstream sparse-reward RL on held-out math benchmarks (AIME25, AIME26, HMMT, IMO Answer), e.g. ExpRL-Process achieves 63.41% pass@1 on AIME26 vs 58.75% for GRPO baseline after Stage-II RL (Table 1).
After Stage-I mid-training with ExpRL alone, the model shows improved pass@1 and pass@16 on held-out math benchmarks, e.g. on HMMT, ExpRL-Outcome achieves 44.19% pass@1 vs 40.60% for base, and ExpRL-Process achieves 45.24%.
Entropy and diversity metrics during training indicate ExpRL maintains higher token-level entropy and unlocks solvable prompts faster than sparse-reward GRPO, suggesting broader coverage of reasoning behaviors (Figure 3).
ExpRL increases verification, self-correction, and backtracking behaviors in reasoning traces compared to the base LLM and other baselines, indicating a shift toward more adaptive search strategies (Figure 5).
Self-distillation suffers from large KL divergence between teacher and student, leading to unstable optimization, whereas ExpRL remains within reachable policy KL space (Figure 4).
ExpRL generalizes beyond math to mixed-domain (math, science QA, coding) mid-training using a smaller judge model for dense rewards and improves pass@1 on all domains except coding where sparse-reward RL remains stronger due to execution feedback (Table 4).
Process-level dense rewards that assign credit to intermediate prefixes enable better credit assignment and exploration compared to outcome-level rewards alone.

Threat model

n/a - The paper is focused on improving RL-based training for LLM reasoning coverage rather than on defense against adversaries or attacks.

Methodology — deep read

Threat Model & Assumptions: The adversary scenario is not explicitly framed as security-focused. The method assumes availability of reference solutions for training questions but hides these from the policy during learning. The adversary is limited to the base LLM policy with no oracle access to references during sampling. The goal is to improve exploratory coverage of reasoning paths rather than directly defend against attacks.
Data: The main mid-training dataset 𝒟mid combines challenging math problems with human-written step-wise reference solutions from InT and POPE datasets totaling thousands of examples. Downstream test benchmarks include held-out math datasets such as HMMT-Nov2025, IMO-AnswerBench, AIME2025 and 2026. A larger mixed-domain dataset of math, science QA, and coding problems with their references is used in scale experiments.
Architecture/Algorithm: ExpRL treats the LLM policy 𝜋𝜃 as a trainable chain-of-thought reasoning model. It uses an LLM judge, often the same base model or a smaller one, to score sampled on-policy reasoning traces relative to the hidden reference solution. Two dense reward types are defined: ExpRL-Outcome gives a scalar score to the full rollout and ExpRL-Process assigns scores to intermediate prefixes using a fixed slicing rule (delimiter-based). The process advantages are calculated as incremental changes in prefix scores to provide stepwise feedback. Policy optimization maximizes expected reward minus a KL regularization term to keep close to the base policy via REINFORCE and GRPO updates, normalizing rewards as appropriate.
Training Regime: Stage-I ExpRL mid-training is run for 230 optimization steps with batches of ~36 prompts, sampling 10 rollouts each, temperature 0.8, max length 16,384 tokens to avoid clipping reward for overly long traces. Stage-II sparse-reward RL follows for 500 steps initializing from the ExpRL policy, using standard binary final-answer rewards. Implementation details include using GRPO-style updates for outcome rewards and REINFORCE for process rewards.
Evaluation Protocol: Metrics include pass@1 and pass@k (k=16, 128) measuring probability of generating a correct answer in independent samples. Experimental comparisons are made across base models, SFT, sparse-reward GRPO, self-distillation, and ExpRL variants on multiple held-out benchmarks. Behavioral analyses track reasoning actions such as verification and backtracking. Training dynamics like entropy and KL divergence are measured to assess exploration and policy shifts.
Reproducibility: Code is publicly released at https://github.com/violetxi/ExpRL. The datasets used are combinations of published reasoning datasets InT, POPE, HMMT, IMO-AnswerBench, etc., but some are likely not fully public. Models are based on the Qwen3 family with instructions. Details on seeds or hardware are not specified. Exact hyperparameter sensitivity and ablations are partly provided in appendices.

Technical innovations

Use of reference solutions as reward scaffolds instead of imitation targets in RL mid-training improves exploration coverage.
Combination of outcome-level and novel process-level dense rewards computed by an LLM judge that scores partial reasoning prefixes to assign credit.
Construction of segment-level advantage functions that measure incremental improvement relative to previous prefixes to better localize learning signals.
On-policy RL training using KL-regularized policy optimization with GRPO and REINFORCE adapted to dense rewards from LLM-judge comparisons.
Use of a smaller, reference-conditioned judge model to reliably score larger policy model rollouts, enabling scalable mixed-domain mid-training.

Datasets

InT + POPE (math reasoning QA pairs) — thousands of examples — public research datasets
HMMT-Nov2025 (held-out math benchmark) — unknown exact size — held-out test set
IMO-AnswerBench — unknown size — test benchmark
AIME 2025, AIME 2026 — math benchmark contests — held-out
Mixed-domain 4,001 examples, spanning math, science QA, coding — internal constructed mixture

Baselines vs proposed

Base Qwen3-4B-Instruct after downstream RL: e.g. AIME26 pass@1 = 51.40% vs ExpRL-Process after downstream RL = 63.41%
Supervised fine-tuning (SFT) on reference traces: e.g. AIME26 pass@1 = 30.26% vs ExpRL-Process = 63.41%
Sparse-reward GRPO: e.g. AIME26 pass@1 after downstream RL = 58.75% vs ExpRL-Outcome = 61.74%
Self-distillation baseline: e.g. AIME26 pass@1 = 58.41% vs ExpRL-Outcome = 61.74%
Stage-I pass@1 on HMMT: Base model 40.60% vs ExpRL-Outcome 44.19% vs SFT 20.09%
ExpRL-Outcome / Process variants improve token-level entropy and unlocked solvable prompts faster than sparse GRPO in training dynamics

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.17024.

Fig 10

Fig 10: Stage-I pass@k on held-out answer-based benchmarks after RL priming. ExpRL improves

Limitations

Relies on availability of high-quality human-written reference solutions for mid-training; not evaluated without references.
Sparse-reward RL performance can remain limited by downstream RL algorithm; ExpRL only primes initialization.
Coding domain shows weaker gains as partial correctness is harder to assess via reference comparisons.
Judging accuracy and reward faithfulness depend on LLM judge calibration, which may vary across domains.
Training details such as seed stability, compute requirements, and sensitivity to hyperparameters are not fully explored.
No adversarial evaluation of robustness to manipulation or spurious correlations in reward assignment.

Open questions / follow-ons

How well does ExpRL scale to much larger LLMs beyond 8 billion parameters?
What is the impact of judge model quality and miscalibration on reward reliability and training stability?
Can ExpRL dense reward scaffolding be combined with online environment or interactive feedback during RL?
How to extend the method to unlock compositional reasoning strategies beyond partial progress on references?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, ExpRL demonstrates a promising approach to training LLM policies for complex reasoning tasks where sparse reward signals are insufficient. By using large sets of human-provided references as scaffolding to generate dense, granular rewards and reinforce partial progress, the model more robustly explores diverse reasoning strategies critical for solving challenging problems. This method emphasizes the importance of reward design in RL to overcome exploration bottlenecks and suggests that similar dense reward scaffolds could be leveraged to train LLMs to better recognize bot-generated patterns or to refine challenge generation with finer-grained feedback. However, it relies on curated references and judge models for reward scoring, which may limit direct applicability in open-world adversarial settings typical in bot detection. Future work incorporating judgment calibration and mixed-domain reward models would be needed to translate these ideas to bot-defense contexts. Still, the principle of shaping exploration with dense reference-guided rewards offers useful insights for designing LLM-based bot detectors or CAPTCHA solvers that require nuanced stepwise reasoning coverage.

Cite

bibtex

@article{arxiv2606_17024,
  title={ ExpRL: Exploratory RL for LLM Mid-Training },
  author={ Violet Xiang and Amrith Setlur and Chase Blagden and Nick Haber and Aviral Kumar },
  journal={arXiv preprint arXiv:2606.17024},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.17024}
}

ExpRL: Exploratory RL for LLM Mid-Training ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​