Skip to content

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Source: arXiv:2606.13680 · Published 2026-06-11 · By Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen et al.

TL;DR

This paper addresses the challenge of improving complex reasoning tasks in large language models by teaching them to reason by analogy rather than relying solely on parametric knowledge. Conventional retrieval-augmented generation methods retrieve contexts based on lexical or semantic similarity, which often poorly correlates with reasoning utility—retrieved problems that look similar on the surface may require different solution strategies, and vice versa. The authors propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that first distills gold-standard reasoning relevance labels using a strong judge model to identify which candidate solutions provide transferable reasoning benefits. A dense retriever is then trained to surface these structurally analogous examples, and finally the policy model is fine-tuned via reinforcement learning conditioned on retrieved analogous demonstrations. This allows the model to learn to leverage reasoning patterns from analogous problems under verifiable outcome rewards, instead of simply imitating reference solutions or training curricula.

RA-RFT demonstrates consistent improvements across challenging mathematical reasoning benchmarks (AIME 2024/25, HMMT 2025, BrUMO 2025) on the Qwen3 model family. It outperforms strong baselines including standard RL fine-tuning methods like GRPO, on-policy self-distillation (OPSD), and curriculum-based RL (QuestA). Notably, on AIME 2025 average@32 accuracy, RA-RFT improves by 7.1 and 2.8 absolute points over GRPO on Qwen3-1.7B and Qwen3-4B, respectively. Ablations confirm that gains come from retrieval grounded in reasoning utility rather than semantic similarity, and from optimizing under reinforcement signal rather than supervised imitation. RA-RFT also surfaces diverse, complementary reasoning scaffolds that aid different problems, unlocking more meaningful context augmentation for reinforcement fine-tuning.

Key findings

  • RA-RFT improves AIME 2025 average@32 accuracy by +7.1 points on Qwen3-1.7B and +2.8 points on Qwen3-4B over GRPO baseline.
  • RA-RFT yields +4.1 and +2.6 points average accuracy gains across four benchmarks (AIME24, AIME25, HMMT25, BrUMO25) on Qwen3-1.7B and Qwen3-4B respectively.
  • Gold-relevance distillation using GPT-4o judge model enables training a retriever that achieves recall@1 of 43.5% versus 14.7% without fine-tuning (Table 3), substantially improving retrieval quality for reasoning utility.
  • Retrieval augmentation under supervised fine-tuning (RA-SFT) gives negligible improvement (+0.8%), while RA-RFT’s reinforcement objective unlocks much larger gains (+7.1% on AIME25, Table 2).
  • Top-1 retrieved analogical context improves per-sample accuracy over baseline for majority of problems; moreover, retrieved contexts exhibit significant diversity, with best and worst retrievals ranging widely in utility (Fig 4).
  • Reason-ModernColBERT, a late-interaction multi-vector retriever, outperforms single-vector Qwen3-Embedding-4B both before and after fine-tuning on reasoning supervision (Table 3).
  • RA-RFT starts with lower initial accuracy than GRPO due to unfamiliar retrieved traces but surpasses it steadily during training (Fig 3), demonstrating effective learning to leverage analogies.
  • Randomly assigning unrelated retrieval contexts harms performance, confirming the importance of reasoning-aware retrieval.

Methodology — deep read

  1. Threat Model and Assumptions: The adversary is not explicitly modeled as this work focuses on improving model reasoning performance. The main assumption is that the model’s parametric knowledge is insufficient to reason correctly on novel problems and needs external analogous examples to scaffold reasoning. The retriever aims to surface reasoning-utility relevant examples rather than semantically similar ones, addressing mismatches in reasoning patterns.

  2. Data: Training queries comprise 12,500 problems from QuestA selected for fairness, covering diverse mathematical reasoning challenges. The retrieval corpus contains 220,000 problems from OpenR1-Math-220K dataset, excluding any overlap with training and evaluation benchmarks (AIME 2024/25, HMMT Feb 2025, BrUMO 2025). Each problem in the corpus is paired with teacher-generated step-by-step reasoning traces from a strong model (Qwen3-235B-A22B). Problem types are coarsely labeled to restrict judge comparisons and reduce computational cost.

  3. Architecture and Algorithm: RA-RFT involves three key components:

  • Gold-Relevance Distillation: A judge model (GPT-4o) performs pairwise binary labeling on (query problem, candidate corpus problem) pairs to judge if their reasoning traces share transferable reasoning patterns, regardless of lexical or semantic similarity. This yields binary relevance labels for training.
  • Reasoning-Aware Retriever Training: A dense retriever Rθ based on Reason-ModernColBERT (a late-interaction multi-vector retriever) is fine-tuned using contrastive learning with the InfoNCE loss. Positive examples are those judged reasoning-relevant and negatives include other corpus entries, encouraging the retriever to rank truly analogous reasoning traces higher.
  • Reinforcement Fine-Tuning with Retrieved Demonstrations: The target policy model Mϕ (Qwen3-1.7B or 4B) is fine-tuned with reinforcement learning from verifiable rewards (RLVR), implemented with Group Relative Policy Optimization (GRPO). For each training query, top-k retrieved reasoning traces from Rθ are concatenated to the query as context. The model generates multiple sampled solutions given these traces. Rewards are computed by comparing generated answers to gold answers. The policy gradient objective maximizes normalized advantages, conditioning on retrieved analogous traces to improve reasoning success rate by leveraging external structured scaffolding.
  1. Training Regime: Foundation models Qwen3-1.7B and 4B start with instruction-tuned checkpoints. Retriever fine-tuning uses temperature hyperparameter τ=0.05 with contrastive losses. Reinforcement fine-tuning uses GRPO with 16 rollouts per query, no KL penalty, learning rate 1e-6, maximum rollout length 32,768 tokens. Training utilizes 64 NVIDIA H100 GPUs. Full-parameter updates for both retriever and policy models with random seeds unspecified. Retrieval uses k=1 trace per query during training and inference.

  2. Evaluation Protocol: Performance is measured by average@32 accuracy—fraction of problems correctly answered out of 32 samples per problem—across 4 mathematical reasoning benchmarks representing university-level competitions. Baselines include base instructed models, GRPO, OPSD (self-distillation), and QuestA curriculum training. Additional analyses measure retriever recall@1 on a 10k sample held-out gold relevance set. Ablations test effect of retrieval augmentation under supervised (cross-entropy) vs reinforcement objectives, retriever model choice, and context diversity impact on per-problem accuracy. No reported statistical tests or cross-validation; evaluation is on held-out competition benchmarks.

  3. Reproducibility: The retriever is fine-tuned from publicly available Reason-ModernColBERT checkpoint. Training data includes publicly released QuestA and OpenR1-Math-220K datasets. Models used are Meta’s Qwen3 series with some large checkpoints (Qwen3-235B-A22B) for teacher generation and GPT-4o for judging, both closed access. Code and weights for RA-RFT are not explicitly mentioned released. Some key hyperparameters and resource specs are provided.

Concrete Example End-to-End: Consider a training query q involving a complex algebraic identity. The gold-relevance distillation uses GPT-4o to evaluate many candidate traces from the corpus to find those with reasoning patterns analogously useful to solve q. These labeled pairs train a dense retriever to rank such analogous traces highly. During RL fine-tuning, the model queries this retriever to retrieve the top trace c1. The query plus c1 is fed to the policy, which samples multiple reasoning attempts. Correct answers yield positive rewards that propagate through GRPO to update policy weights to better leverage retrieved analogies. Over training, the model learns to attend and apply structurally relevant reasoning exemplars, boosting final reasoning accuracy.

Technical innovations

  • Gold-relevance distillation uses a judge model to label retrieval supervision based on reasoning utility rather than surface lexical or semantic similarity.
  • Training a dense multi-vector retriever with contrastive learning to rank retrieved examples by transferable reasoning patterns instead of embedding closeness.
  • Incorporating retrieved reasoning traces as context during reinforcement fine-tuning (RA-RFT) to teach the model to reason by analogy under verifiable outcome rewards.
  • Showing that retrieval augmentation benefits are unlocked only under reinforcement learning objectives (RLVR) rather than supervised fine-tuning, which passively imitates.
  • Demonstrating that reasoning-aware retrieval surfaces diverse complementary solution strategies that provide distinct scaffolding for different problems.

Datasets

  • QuestA Training Set — 12.5k math reasoning problems — public dataset used for training queries
  • OpenR1-Math-220K — 220k math problems with reasoning traces — used as retrieval corpus, public
  • AIME 2024/2025 benchmarks — several hundred competition problems each — public math contest dataset
  • HMMT February 2025 — math competition dataset, public
  • BrUMO 2025 — math reasoning benchmark, public

Baselines vs proposed

  • GRPO baseline on Qwen3-1.7B: AIME 2025 average@32 accuracy = 41.6%, RA-RFT = 48.7%
  • GRPO baseline on Qwen3-4B: AIME 2025 average@32 accuracy = 66.4%, RA-RFT = 69.2%
  • OPSD on Qwen3-1.7B (average@16): 38.3% AIME24 vs RA-RFT 55.1% (average@32, different eval but stronger)
  • QuestA on Qwen3-1.7B: 42.7% AIME 25 vs RA-RFT 48.7%
  • Random trace retrieval on Qwen3-1.7B with RA-RFT: 37.6% average accuracy vs 47.4% with reasoning-aware retriever
  • Qwen3-Embedding-4B retriever recall@1 = 14.7% vs Reason-ModernColBERT (trained) recall@1 = 43.5%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13680.

Fig 1

Fig 1: Motivation of RA-RFT. Left: A training query may be retrieved with a surface-similar but reasoning-irrelevant

Fig 2

Fig 2: Overview of the RA-RFT framework.

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 4

Fig 4: Per-sample accuracy of RA-RFT under different retrieved contexts versus raw GRPO on Qwen3-1.7B. Each

Limitations

  • The approach relies on a strong judge model (GPT-4o) for gold-relevance distillation, which may not always be available or cost-effective.
  • Evaluation is limited to mathematical reasoning benchmarks; applicability to other reasoning domains or languages is untested.
  • Reproducibility is constrained by closed models and some proprietary datasets (e.g., teacher models and judge model not public).
  • No adversarial or out-of-distribution robustness testing was performed to assess retriever/model behavior under challenging or deceptive inputs.
  • The retriever and policy models are trained separately rather than end-to-end; joint optimization could further improve integration.
  • The impact of retrieval corpus size, quality, or domain mismatch on performance is not extensively analyzed.

Open questions / follow-ons

  • How well does retrieval-augmented reinforcement fine-tuning generalize to open-domain or less-structured reasoning tasks beyond math competitions?
  • Can the retrieval and policy models be co-trained end-to-end to better align retrieval with evolving policy capabilities?
  • What are the effects of integrating multi-hop or dynamic retrieval strategies that adapt during inference and training?
  • How sensitive is RA-RFT to the quality and diversity of the retrieval corpus, and can automated corpus construction improve performance?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights that retrieval augmentation for models performing complex reasoning benefits greatly from retrieval based on reasoning utility rather than superficial similarity. Standard lexical or semantic similarity retrieval may fail to provide relevant contextual analogies, misleading model outputs. Integrating reasoning-aware retrieval mechanisms that surface diverse, structurally analogous examples can significantly improve model reasoning accuracy, especially when combined with reinforcement learning objectives that optimize for outcome correctness rather than imitation. This suggests that CAPTCHA or bot-detection systems leveraging language models with reasoning capabilities could enhance robustness by conditioning on externally retrieved, reasoning-relevant exemplars to better handle novel challenges and obfuscated bot query patterns. Careful design of retrieval supervision and retriever fine-tuning is critical to avoid retrieval noise degrading model reliability.

Cite

bibtex
@article{arxiv2606_13680,
  title={ Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning },
  author={ Zilin Xiao and Qi Ma and Chun-cheng Jason Chen and Xintao Chen and Avinash Atreya and Hanjie Chen and Vicente Ordonez },
  journal={arXiv preprint arXiv:2606.13680},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13680}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution