Self-Trained Verification for Training- and Test-Time Self-Improvement

Source: arXiv:2605.30290 · Published 2026-05-28 · By Chen Henry Wu, Aditi Raghunathan

TL;DR

This paper addresses the challenge of improving reasoning models' self-verification capability, crucial for both test-time iterative refinement and training-time self-training. The authors propose Self-Trained Verification (STV), a method that trains a verifier to catch subtle errors in plausible but incorrect model-generated solutions by distilling from a reference-conditioned teacher verifier that has access to ground-truth solutions. This overcomes the lack of direct training signal for feedback quality in verifiers. STV substantially improves verifier-guided refinement (V-R) loops on hard math and scientific reasoning tasks, doubling or even increasing pass@1 accuracy by 14x compared to untrained verifiers. Moreover, integrating STV verifiers in the training loop (verifier-in-the-loop, ViL) yields a further 33% test-time improvement and a 30% boost in the generator's standalone performance despite no verifier present at test time.

The study reveals that STV verifiers provide more calibrated verdicts and actionable feedback that guide generators to explore solution refinements effectively, outperforming standard approaches like supervised fine-tuning, reinforcement learning on verdicts, and meta-verifiers. The method scales well across verifier sizes, and smaller trained verifiers can approach larger untrained ones. Their experiments use the Qwen3 model family on challenging benchmarks such as DAPO math and SciKnowEval scientific reasoning, demonstrating generality beyond math domains. The authors emphasize that training stronger verifiers unlocks new frontiers of self-improvement for both generator and verification components, which may be a key route forward in reasoning with large language models.

Key findings

STV verifier roughly doubles final-round pass@1 on Hard math problems compared to untrained verifiers (e.g., 5.5% vs 2.7% on Hardest split) and achieves 14× lift on scientific reasoning tasks (1.5% to 21%).
An 8B STV verifier guided generator outperforms a 4× larger 32B generator without verification (27.4% vs 17.8% pass@1 on Hard split).
Verifier-in-the-loop training (ViL) starting from an RL-converged generator yields a further 33% pass@1 gain at final round with verifier feedback and a 30% standalone pass@1 gain without verifier at test time.
On-policy distillation (OPD) from a reference-conditioned teacher verifier outperforms supervised fine-tuning (SFT) approaches for training the verifier.
Smaller STV verifiers (4B, 1.7B) nearly match or exceed untrained larger verifiers (8B) when paired with the same generator.
STV mitigates verifier score inflation and miscalibration, showing 3–5× higher precision at fixed coverage for accept decisions in V-R loops.
Verifier-guided refinement substantially outperforms best-of-N resampling for test-time compute scaling with both base and STV-trained generators.
Diagnostic feedback from the STV verifier adds value beyond simple verdict labels, improving pass@1 by +3.2% over ground-truth verdict only baselines.

Threat model

The adversary is the generator producing plausible but flawed solutions that the verifier must detect. The verifier does not have oracle access at test time and must judge correctness and provide feedback solely based on the input problem and solution. The verifier lacks access to the true answer during inference, though privileged access is exploited during training via a teacher verifier with ground-truth reference solutions.

Methodology — deep read

The paper focuses on reasoning tasks where a generator model produces candidate solutions and a verifier judges correctness and provides feedback in a verification-refinement (V-R) loop. The fundamental bottleneck is the verifier's ability to detect subtle errors and give actionable feedback without human-labeled data.

Threat Model & Assumptions: The adversary here is the flawed generator that produces plausible but incorrect solutions. The verifier aims to detect errors without ground-truth feedback but can be trained using privileged information (reference solutions). Verifiers must work at test time without access to references.
Data: Experiments use the DAPO math benchmark (Yu et al., 2025) split into "Hard" and "Hardest" subsets based on an 8B model's baseline pass@1 performance (e.g., Hardest with pass@1=0). Around 150 test problems per split remain after embedding-based de-duplication. Also evaluated on SciKnowEval, a multi-domain scientific reasoning dataset. Both train and test sets are used for generator and verifier training.
Architecture / Algorithm: The verifier is parameterized as V_θ(x, y_{r-1}) predicting verdict and feedback without the reference. The teacher verifier V*(x, y_{r-1}, y^(x)) has access to the ground-truth solution y^(x) and is used as a privileged teacher. Training aligns student verifier output distributions to the teacher’s using on-policy distillation (OPD) with Jensen-Shannon divergence (α=0.5). They also add reinforcement learning (RL) on verdict accuracy as an auxiliary objective.
Training Regime: Training is done with multiple rollout samples (32 per problem) of the generator-verifier loop, running up to 20 rounds of verification-refinement. On-policy distillation is done by sampling student prefixes and matching to teacher feedback distributions. Both untrained and RL-trained base generators are used, and verifier-in-the-loop training (ViL) continues RL training of the generator with a frozen STV verifier providing feedback.
Evaluation Protocol: Metrics include pass@1 (exact match correctness) measured at various verification rounds. Evaluations include comparisons to untrained verifiers, meta-verifiers with proxy human feedback, SFT and RL baselines, and best-of-N sampling baselines for compute-matched comparisons. Statistical significance tests (p<0.01 or <0.05) confirm gains. Precision-coverage curves and calibration plots demonstrate reduced verifier score inflation.
Reproducibility: Code and model weights are released at their public GitHub: https://github.com/ar-forum/stv. Datasets are mostly public or from prior published benchmarks.

Example end-to-end: For a math problem x, the generator proposes a solution y0. The reference-conditioned teacher verifier, with access to the ground truth y*, generates feedback f1 highlighting logical gaps or errors in y0. The student verifier is trained via OPD to imitate this feedback without seeing y*. At test time, the student verifier judges y0 and provides feedback f1. If rejected, the generator refines to y1 using f1. This repeats until acceptance or max rounds. This loop substantially improves final pass@1 and also trains the generator to better incorporate verification feedback at training time via ViL.

Technical innovations

Self-trained verification (STV) trains verifiers to diagnose flaws in plausible but incorrect solutions by distilling from a privileged teacher verifier that conditions on reference solutions.
Use of on-policy distillation (OPD) to align the student verifier's distribution with the reference-conditioned teacher, avoiding off-policy generalization degradation seen in supervised fine-tuning.
Verifier-in-the-loop training (ViL) for the generator inside the verification-refinement loop using feedback from a frozen STV verifier, yielding both test-time and training-time improvements.
Demonstration that smaller trained verifiers can nearly match larger verifiers, reducing test-time compute without loss of verification quality.

Datasets

DAPO math benchmark – ∼150 test problems per difficulty split after filtering – publicly released by Yu et al. (2025)
SciKnowEval scientific reasoning – released by Feng et al. (2024)

Baselines vs proposed

No verifier training: pass@1 peaks at 2.7% on Hardest problems vs STV verifier 5.5%
Qwen3-32B large generator without verifier: pass@1 17.8% on Hard vs STV-verifier-guided Qwen3-8B generator 27.4%
Verdict RL verifier: marginal pass@1 improvement compared to no verifier training
Meta-verifier RL verifier (proxy with GPT-5.2): no significant gain over no verifier training
SFT verifier: no pass@1 gains, worse than OPD-based STV verifier
Verifier-in-the-loop (ViL) training from RLVR converged generator: 33% relative pass@1 gain at final round vs starting point
ViL generator standalone (no verifier) pass@1 up 30% vs standard RL baseline

Limitations

STV requires access to reference oracles during training for the privileged teacher verifier, limiting applicability where labeled references are unavailable.
Experiments primarily focus on math and science reasoning tasks with structured answers; unclear how well STV extends to less structured or open-ended generation.
Computational cost is nontrivial due to repeated 20-round verification-refinement and multiple rollouts during training and evaluation.
Evaluation excludes adversarial or deliberately tricky inputs designed to fool verifier or generator.
Potential limitations in scaling verifier and generator to much larger models or datasets remain unexplored.
No explicit ablation of how verifier architecture variations affect STV gains.

Open questions / follow-ons

Can STV be extended beyond verification tasks to other capabilities that are 'non-verifiable from scratch but easier with references'?
How to scale STV and ViL training efficiently to larger models and more diverse reasoning benchmarks including code or open-ended tasks?
What is the optimal allocation of compute between generator pretraining, verifier training, and verification rounds for practical deployment?
Can other forms of self-supervision signals (e.g., uncertainty estimates, multiple references, environment feedback) further improve verifier training?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work underscores the vital role of verification components that can accurately identify subtle errors and provide actionable feedback in automated reasoning pipelines. In CAPTCHA schemes relying on automated puzzle-solving or challenge-response generation, improved verification can curb false acceptances of flawed or adversarial solutions, enhancing security. The STV method's ability to train verifiers without human labels suggests a scalable approach to calibrate and refine verification modules that gate solution acceptance. Moreover, iterative verifier-guided refinement and training-time improvements hint at automated systems that better detect and block bot attempts by continuously self-improving.

While this paper targets math and scientific reasoning, the core insight—training verification models by leveraging privileged information accessible only during training—can inspire CAPTCHAs and bot-detectors that dynamically adapt verification rigor. However, integrating such complex V-R pipelines in real-time challenge-response environments may involve latency and compute trade-offs. Evaluating STV-style verifiers' effectiveness against real-world automated attacks and adversarial bots would be a valuable next step.

Cite

bibtex

@article{arxiv2605_30290,
  title={ Self-Trained Verification for Training- and Test-Time Self-Improvement },
  author={ Chen Henry Wu and Aditi Raghunathan },
  journal={arXiv preprint arXiv:2605.30290},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30290}
}

Self-Trained Verification for Training- and Test-Time Self-Improvement ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​