Reward Modeling for Multi-Agent Orchestration

Source: arXiv:2606.13598 · Published 2026-06-11 · By King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz et al.

TL;DR

This paper addresses the challenge of efficiently training and scaling Multi-Agent Systems (MAS) built on Large Language Models (LLMs), where orchestration of specialized agents is key to task performance but is hindered by limited annotated supervision and high computational cost. The authors propose Orchestration Reward Modeling (Orch-RM), a self-supervised reward modeling framework that directly scores orchestration plans without requiring expensive full sub-agent rollouts or human annotations. Orch-RM constructs win-lose pairs from intermediate model checkpoints and multi-agent execution artifacts and trains a Bradley-Terry model to evaluate orchestration quality. This enables more efficient test-time trajectory selection and reward-guided orchestrator training.

Experimentally, Orch-RM improves test-time scaling accuracy by up to 8% across reasoning and QA benchmarks compared to strong baselines, while reducing token usage for verification by an order of magnitude or more. When used as a reward signal for continued orchestrator training, Orch-RM consistently outperforms trajectory-level reinforcement learning approaches with 10x-46x fewer tokens consumed, and even enables training from scratch to approach performance of RL-trained models. These results demonstrate that orchestration-level reward modeling is a scalable and effective supervision paradigm for automated multi-agent orchestration.

Key findings

Orch-RM improves Best-of-N (N=8) test-time scaling accuracy on AIME 24&25 from 63.33% to 68.33% while using only 2.38M tokens versus 142.80M tokens for GPT-5-mini trajectory-level judge.
On BrowseComp+, Orch-RM achieves 14.00% accuracy versus 9.50% for majority vote baseline, using 8.26M tokens compared to 142.80M for trajectory-level methods.
Continued training with Orch-RM increases majority-vote accuracy on AIME 24&25 from 65% baseline to 68.33%, with ~10x fewer training tokens than GRPO trajectory-level RL.
Orch-RM enables training orchestrators from scratch to reach 61.67% majority-vote accuracy on AIME 24&25, exceeding untrained baseline by over 2x, without trajectory-level supervision.
Combining specialized-over-base and correct-over-incorrect pair data sources in a 3:1 ratio optimizes reward model training and yields best test-time scaling performance.
Orch-RM delivers strong accuracy-efficiency tradeoffs across diverse domains: mathematical reasoning, web-based question answering, multi-hop reasoning, and scientific reasoning (GPQA).
Orchestration-level scoring lowers expensive sub-agent rollouts during training and inference, reducing token usage for verification and learning by ~10x to ~46x on tested datasets.
Orch-RM consistently outperforms orchestration-level LLM-as-a-Judge methods and off-the-shelf reward models across benchmarks (e.g., HotpotQA, AIME, BrowseComp+).

Threat model

N/A; the paper focuses on improving orchestration policy training and inference efficiency rather than security or adversarial threat modeling. The framework assumes no explicit adversarial manipulation or attack on the reward model or orchestrator.

Methodology — deep read

The paper focuses on building an efficient orchestration-level reward model "Orch-RM" to guide multi-agent orchestration within large language model systems. The key steps are:

Threat Model & Assumptions: The adversary is not explicitly modeled since the focus is on self-supervised reward learning and policy training. Instead, the system assumes no human-labeled preference pairs for orchestration quality, relying on intermediate execution artifacts and outcomes for supervision.
Data Collection & Construction: Using the existing MAS-Orchestra framework, the authors collected intermediate orchestration plans (z), rolled out by different orchestrators:
- "Specialized-over-base": pairs comparing trajectories from a trained orchestrator πθ with a base model π0.
- "Correct-over-incorrect": pairs constructed by labeling trajectories based on final correctness of answers, without human annotation. These datasets combine 3:1 ratio of specialized-over-base to correct-over-incorrect pairs, averaging 12 and 4 such pairs per input respectively.
Reward Model Architecture & Objective: The reward model rφ takes an input query x and an orchestration plan z as input and predicts a scalar quality score without executing full trajectories. The model is trained using the Bradley-Terry loss, which optimizes the logistic likelihood that higher-quality orchestrations receive higher scores: LR M(φ) = -E_{x, z_w, z_l} log σ(rφ(x, z_w) - rφ(x, z_l))
Training Regime: The reward model is initialized from pre-trained Skywork-Reward-LLaMA-3.1-8B and fine-tuned on constructed pair data. The mixture ratio of pairwise data sources was carefully tuned (best at 3:1 specialized to correctness).
Integration for Test-time Scaling: At inference, multiple candidate orchestrations are sampled from the policy πθ. Orch-RM scores each orchestration, selecting the highest-scoring orchestration prior to sub-agent rollout. This avoids expensive rollouts of all candidates, improving computational efficiency.
Integration for Orchestrator Training: The reward model outputs scalar rewards for candidate orchestrations during policy optimization using Group Relative Policy Optimization (GRPO). This dense feedback at the orchestration level replaces sparse rewards from final task correctness and avoids costly rollout executions.
Evaluation Protocol: Benchmarks include AIME 24&25 (mathematical reasoning), BrowseComp+ (web QA), HotpotQA (multi-hop reasoning), and GPQA (scientific reasoning out of distribution). Metrics focus on accuracy under majority voting and Best-of-N selection (N=8), as well as token usage for verification and training cost. Baselines include trajectory-level policies (MAS-Orchestra), reward modeling (log P, Skywork-LLaMA), LLM judges (GPT-4.1-mini, GPT-5-mini, GPT-5.4-mini), and training methods (RFT, GRPO, DPO).
Reproducibility: The authors provided code at https://github.com/Wang-ML-Lab/OrchRM. They used publicly available models like Qwen2.5-7B-Instruct and GPT-OSS-120B. Datasets are standard benchmarks; however, intermediate MAS-Orchestra training artifacts are required, which are publicly shared.

One end-to-end example: Given an input query on AIME math problems, the orchestrator πθ generates multiple candidate orchestration plans z1,...,zN. Orch-RM scores each based only on z_i without dispatching sub-agents. The highest-scoring plan b.z is executed fully to obtain a final answer. Reward model feedback on pairwise orchestration comparisons drives continued training of πθ with GRPO, improving accuracy efficiently over iterations by reusing Orch-RM scores instead of full rollouts.

Technical innovations

Introduction of Orch-RM, a self-supervised orchestration-level reward modeling framework that learns to score multi-agent orchestrations without costly full trajectory rollouts or human annotations.
Construction of win-lose pair datasets by leveraging intermediate model checkpoints and correctness labels to train a Bradley-Terry reward model specialized for MAS orchestration evaluation.
Application of orchestration-level reward scores for both efficient test-time scaling (prioritizing candidate orchestrations) and practical reward-guided orchestrator training that reduces token usage drastically.
Demonstration that combining specialized-over-base comparisons with correct-over-incorrect comparisons enhances reward model generalization and evaluation quality, improving downstream accuracy.

Datasets

AIME 24&25 — Moderate-sized mathematical reasoning dataset from American Invitational Mathematics Examination series.
BrowseComp+ — Web-based question answering benchmark requiring tool use and planning.
HotpotQA — Multi-hop question answering dataset.
GPQA — Scientific reasoning out-of-distribution benchmark.

Baselines vs proposed

MAS-Orchestra baseline (no verification): AIME 24&25 accuracy = 63.33%; Orch-RM = 68.33%
Majority Voting baseline: BrowseComp+ accuracy = 9.50%; Orch-RM = 14.00%
GPT-5-mini trajectory-level LLM-as-a-Judge BrowseComp+ accuracy = 12.50%; Orch-RM orchestration-level = 14.00% with ~17x fewer tokens
GRPO trajectory-level continued training majority-vote accuracy on BrowseComp+ = 9.50%; Orch-RM continued training = 11.00% with >10x less token usage
Training from scratch: Base model majority-vote accuracy on AIME 24&25 = 23.33%; Orch-RM from scratch = 61.67%
DPO training majority-vote accuracy on AIME 24&25 = 65%; Orch-RM continued training = 68.33%

Limitations

Orch-RM’s performance depends heavily on the diversity and quality of intermediate orchestrator artifacts used for constructing training pairs; limited diversity may reduce reward model generalization.
The reward models are currently trained separately per domain, limiting cross-domain generalizability and requiring domain-specific data collection efforts.
Orch-RM relies on the correctness of final outcomes for some pair constructions, which may propagate errors if the correctness labels are noisy or limited.
The approach has not been tested under active adversarial conditions or against deliberately manipulated orchestration plans, leaving robustness questions open.
While Orch-RM reduces token usage, inference still depends on sampling multiple candidate orchestrations, which can be costly for very large N.
The study focuses on specific multi-agent workflows; applicability to more heterogeneous agent ecosystems or non-LLM sub-agents remains unexplored.

Open questions / follow-ons

Can orchestration-level reward models be effectively trained jointly on multi-domain, multi-task data to improve generalizability across diverse MAS settings?
How robust are orchestration-level reward models to adversarial or out-of-distribution orchestrations, especially those crafted to fool reward signals?
What architectural or training modifications could further reduce the token cost of reward-guided MAS training without degrading performance?
Could intermediate sub-agent output supervision be incorporated alongside orchestration-level signals to improve granularity of feedback while maintaining efficiency?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners working with multi-agent LLM systems to verify or generate human-like challenge responses, Orch-RM presents a promising method to efficiently evaluate and train multi-agent orchestrators. By scoring the orchestration itself before costly sub-agent execution, Orch-RM enables rapid selection of higher-quality agent plans, reducing computational overhead during test-time scaling. This has direct implications for scalable CAPTCHA generation systems where multiple adversarial or puzzle-solving agents collaborate. Furthermore, replacing full trajectory-level feedback with orchestration-level signals facilitates more frequent and resource-efficient orchestrator updates, improving adaptability without human annotation bottlenecks. Bot-defense engineers could leverage Orch-RM principles to design multi-agent evaluation layers that prioritize candidate solutions before deploying expensive verification steps, thereby balancing accuracy, robustness, and operational cost in large-scale CAPTCHA challenges.

Cite

bibtex

@article{arxiv2606_13598,
  title={ Reward Modeling for Multi-Agent Orchestration },
  author={ King Yeung Tsang and Zihao Zhao and Vishal Venkataramani and Haizhou Shi and Zixuan Ke and Semih Yavuz and Shafiq Joty and Hao Wang },
  journal={arXiv preprint arXiv:2606.13598},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13598}
}

Reward Modeling for Multi-Agent Orchestration ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​