MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Source: arXiv:2605.06623 · Published 2026-05-07 · By Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song et al.

TL;DR

MASPO addresses a fundamental challenge in LLM-based Multi-Agent Systems (MAS): how to jointly optimize role-specific prompts across interacting agents when each agent only sees a slice of the task and ground-truth labels are unavailable for intermediate steps. Prior prompt optimization work either tunes agents in isolation (missing inter-agent covariate shift) or relies on fixed candidate pools with Bayesian search (TPE-based MIPRO/MASS), both of which fail to resolve the credit assignment problem—where an agent can produce a locally valid output that systematically misleads downstream peers. MASPO treats the prompt set P = {p1...pN} as joint learnable parameters and introduces a composite reward that measures Local Validity, Lookahead Potential (immediate successors), and Global Alignment (final system output), evaluated without ground-truth labels via LLM-as-judge comparisons. A misalignment mining loop explicitly surfaces traces where local success coexists with global failure and injects them as hard negatives into the prompt optimizer.

The optimization loop runs as a coordinate-ascent beam search over the agent graph's topological order. Each agent is optimized for T steps per round over D rounds with a beam width K=2, generating Ksub=2 candidate variations per parent using execution traces as few-shot context for Gemini-2.5-Pro as the Optimizer model. A Beam Refresh mechanism re-anchors cumulative scores at the start of each new epoch to counteract covariate shift as peer agents evolve, using a centered win-rate (J_new = R(p, pbest) − 0.5) rather than stale absolute scores. The sample pool is kept deliberately small (|D|=50, mini-batch |B|=10) to minimize annotation and API cost.

Across 6 benchmarks (MATH-500, AGIEval-MATH, AQuA, GPQA-Diamond, MBPP, HumanEval-ET) using Qwen3-8B as the backbone agent model, MASPO achieves a +2.90 average accuracy improvement over the strongest prompt-optimization baseline (SPO) and +5.06 / +2.73 over unoptimized Sequential / Hierarchical MAS respectively. Optimized prompts also transfer to stronger backbone models (DeepSeek-V3, GLM-4.6, Claude-Sonnet-4, Gemini-2.5-Pro) without re-optimization, yielding consistent gains of roughly 2–4 points average accuracy.

Key findings

MASPO achieves +2.90 average accuracy over the best competing prompt-optimization baseline (SPO) across 6 benchmarks (MATH-500, AGIEval-MATH, AQuA, GPQA-Diamond, MBPP, HumanEval-ET) using Qwen3-8B as the MAS backbone.
On Sequential MAS, MASPO improves average accuracy by 5.06 points over the unoptimized baseline (65.31 → 70.39); on Hierarchical MAS, by 2.73 points (68.32 → 71.05).
GPQA-Diamond shows the largest single-benchmark gain: +9.05 points in Sequential MAS (48.04 → 58.08) when comparing MASPO to TPE, the next-best joint-capable baseline.
Beam Refresh ablation reveals that Kendall's top-1 beam overlap across rounds is only 0.63 without refresh, confirming that a large fraction of retained candidates become stale as peer agents evolve; removing Beam Refresh drops average accuracy from 70.39 to 68.53.
Removing Joint Evaluation (reverting to local-only scoring) drops average accuracy from 70.39 to 67.77; removing Misalignment Sampling drops it to 69.68; substituting success-case sampling further degrades to 69.29—worse than random sampling—confirming hard negatives are the signal source.
Optimized prompts transfer across architectures without re-optimization: on Gemini-2.5-Pro backbone, applying Qwen3-8B-optimized prompts raises average accuracy from 84.93 to 87.14 (+2.21); on DeepSeek-V3, from 71.79 to 75.86 (+4.07).
Using Qwen3-8B as both optimizer and evaluator (instead of Gemini-2.5-Pro) still yields gains over vanilla MAS (67.70 avg vs. 65.31 Sequential MAS baseline), though below the full-system result of 70.39, demonstrating partial independence from a frontier optimizer.
SPO given identical search budget or identical Gemini API call budget still underperforms MASPO (67.79 and 67.86 respectively vs. 70.39), ruling out compute as the confounding variable.

Threat model

n/a — MASPO is a prompt optimization framework for multi-agent LLM systems, not a security or adversarial robustness paper. The paper identifies an internal structural failure mode ('Local-Global Misalignment') where an agent satisfies its local prompt constraints but degrades downstream performance, but this is framed as a coordination problem rather than an external adversarial attack. No external adversary, attacker capabilities, or security threat model are defined.

Methodology — deep read

Threat model and problem framing: MASPO is not a security paper, but it does frame an adversarial-style failure mode: the 'Local-Global Misalignment' scenario where an upstream agent satisfies its role-specific prompt perfectly yet produces output that misleads downstream agents, causing system-wide failure. The adversary here is structural—the combinatorial coupling of agent prompts—not an external attacker. Assumptions: agents are orchestrated via a fixed directed communication graph G=(V,E) with known topology; the backbone LLM is frozen (inference-only, temperature=0); only system prompts are tunable; no ground-truth labels are available for intermediate agent outputs; and a powerful meta-optimizer (Gemini-2.5-Pro) is accessible for prompt generation and evaluation.

Data provenance and splits: Each benchmark (MATH-500, AGIEval-MATH Level-5, AQuA, GPQA-Diamond, MBPP, HumanEval-ET) is used with a deliberately small unlabeled sample pool of |D|=50 instances drawn from the respective test/dev sets. During each optimization iteration a mini-batch of |B|=10 is randomly sampled from this pool for trace collection and joint reward computation. The paper does not report a separate held-out test split for optimization—evaluation metrics are computed on the full benchmark test sets after optimization concludes. The small pool design is intentional to demonstrate low-data applicability, but this also means optimization and final evaluation data may partially overlap (the paper does not explicitly clarify whether the 50-sample pool is disjoint from the reported test set).

Architecture and algorithm—core loop: MASPO instantiates a MAS as a directed graph where each agent vi has a learnable prompt pi and a frozen LLM inference function fi. The optimization objective (Eq. 3) is to find P* = argmax E[R(Φ(G,P,q), o*_glob)], but since intermediate ground truth is unavailable, MASPO replaces R with a self-supervised composite proxy. The loop proceeds in topological agent order over D=3 rounds, optimizing each agent for T=3 steps per round (interleaved, not fully converged, to prevent upstream overfitting to stale downstream behaviors). At each step, the Optimizer model Mopt (Gemini-2.5-Pro, temperature=0.7) receives the current parent prompt plus a mini-batch of execution traces (context C, query q, output o) as few-shot examples, generating Ksub=2 candidate variations per parent in the beam. Additionally, up to Kmis=3 misalignment cases are injected from a memory buffer Bmis to bias generation toward repairing coordination failures (Eq. 4).

Architecture and algorithm—Joint Reward Model: Each candidate prompt p' is evaluated against the parent prompt p via a pairwise LLM judge (Evaluator model Meval, Gemini-2.5-Pro, temperature=0) across three dimensions (Eq. 5): Local Validity (α=0.4): does the candidate's output o'_i beat the baseline output o_i in role-specific quality? Lookahead Potential (β=0.4): when downstream agents vj ∈ Nout(vi) receive the new context generated by vi under p', do their outputs o'_j improve over oj? This requires actually running the successor agents under the new upstream output. Global Alignment (θ=0.2): does the final system response o'_glob improve over oglob? The composite score R is the average over the mini-batch of binary indicator sums weighted by α, β, θ (constraint: α+β+θ=1). The ablation in Figure 2 shows the performance landscape over (α, β) confirming that local+lookahead dominate; lower global weight is optimal, likely because global signal is noisy with only 10 samples.

Misalignment Mining: After computing the joint reward for each sample k, samples satisfying I(o'_i ≻ oi)(k)=1 AND (Lookahead(k)=0 OR Global(k)=0) are flagged as misalignment cases (Eq. 6) and stored in buffer Bmis. These are hard negatives: the agent did its local job but broke the system. On the next generation step, up to Kmis=3 such traces are prepended to the optimizer's context, explicitly forcing Mopt to generate prompts that avoid these failure patterns. Sensitivity analysis (Table 2, Section III) shows Kmis=3 is near-optimal; success-case sampling instead degrades performance, confirming the hard-negative hypothesis.

Beam search and Beam Refresh: The framework maintains a beam of K=2 top-scoring prompt candidates. Cumulative scores J(p') = R(p', p_parent; Biter) + J(p_parent) are accumulated across optimization steps within a round. When an agent is re-visited in a new epoch (after peer agents have been updated), all stale cumulative scores are discarded. Each candidate is re-scored relative to the current global best prompt p_best using a centered win-rate: J_new(p) = R(p, p_best; Biter) − 0.5 (Eq. 8). Subtracting 0.5 ensures that candidates worse than p_best receive negative scores, anchoring the beam to the current performance manifold. The Kendall top-1 overlap of 0.63 across rounds (cited in the paper for the no-refresh ablation) quantifies how stale historical rankings become—this is the empirical motivation for the refresh.

Evaluation protocol and baselines: Final evaluation uses full benchmark test sets. Baselines span four tiers: (1) single-agent methods (Vanilla, CoT, SC-CoT, Self-Refine); (2) fixed-architecture multi-agent methods (Sequential MAS, Hierarchical MAS, AgentDropout); (3) prompt-optimized MAS with TPE (MIPRO/MASS style, fixed candidate pool); (4) prompt-optimized MAS with SPO (single-agent optimizer adapted to MAS). No statistical significance tests (e.g., bootstrap, t-test) are reported—results are point estimates. Ablation is thorough across 8 dimensions (Table 2) but all ablations use Sequential MAS + Qwen3-8B only. Cross-model transfer (Table 3) tests prompt portability to DeepSeek-V3, GLM-4.6, Claude-Sonnet-4, Gemini-2.5-Pro without re-optimization. Code is released at https://github.com/wangzx1219/MASPO.

Concrete end-to-end example (Sequential MAS, GPQA-Diamond): The system has N agents in a sequential chain. Starting from random/minimal initial prompts, in Round 1 the optimizer targets Agent 1: it samples 10 GPQA queries, runs the full chain, collects traces (query, context, output) for Agent 1, finds misalignment cases where Agent 1's output looks locally coherent but leads Agent 2 or the final answer astray, and feeds those plus 10 random traces to Gemini-2.5-Pro to generate 2 candidate prompt variants. Each variant is scored on local, lookahead (Agent 2's output quality), and global (final answer) dimensions. The top-2 are retained in the beam. This repeats for T=3 steps, then Agent 2 is optimized similarly for 3 steps. This full topological pass constitutes one round; D=3 rounds are run. At convergence, MASPO's Sequential MAS reaches 58.08% on GPQA vs. 48.04% for TPE-optimized Sequential MAS—a +10.04 point gain on this benchmark specifically.

Technical innovations

Multi-Granularity Joint Reward (Eq. 5) that decomposes prompt quality into Local Validity, Lookahead Potential (immediate successor output quality), and Global Alignment without requiring ground-truth labels for intermediate agents—prior work (SPO, TPE-based MIPRO/MASS) used only local or final-outcome signals.
Misalignment-Aware Sampling (Eq. 6) that explicitly mines traces where local success coexists with global failure and injects them as hard negatives into the prompt generator, contrasting with standard random or success-biased sampling used in prior prompt optimizers.
Beam Refresh Mechanism (Eq. 8) that discards stale cumulative beam scores and re-anchors candidates to the current global best via a centered win-rate whenever peer agents are updated, directly addressing covariate shift in the non-stationary MAS optimization landscape—not addressed in prior beam-search prompt optimizers.
Interleaved coordinate-ascent scheduling (step size T per agent per round, D rounds) that prevents upstream agents from overfitting to early suboptimal downstream behaviors, unlike fully sequential convergence used in prior multi-agent optimization (MIPRO, MASS).
Trace-Guided Generative Proposal (Eq. 4) that grounds prompt mutations in actual inter-agent execution traces (including upstream context C), enabling the optimizer to reason about communication dependencies rather than treating prompt optimization as context-free string mutation.

Datasets

MATH-500 — 500 problems — public (Hendrycks et al., 2021)
AGIEval-MATH (Level-5 subset) — size not explicitly stated in truncated text — public (Zhong et al., 2024)
AQuA — size not explicitly stated — public (Patel et al., 2021)
GPQA-Diamond — size not explicitly stated — public (Rein et al., 2024)
MBPP — size not explicitly stated — public (Austin et al., 2021)
HumanEval-ET — size not explicitly stated — public (Dong et al., 2025)

Baselines vs proposed

Vanilla (single-agent): Avg = 65.27 vs MASPO Sequential MAS: 70.39
CoT (single-agent): Avg = 65.86 vs MASPO Sequential MAS: 70.39
SC(CoT) (single-agent): Avg = 66.48 vs MASPO Sequential MAS: 70.39
Self-Refine (single-agent): Avg = 66.27 vs MASPO Sequential MAS: 70.39
AgentDropout: Avg = 66.89 vs MASPO Sequential MAS: 70.39
Sequential MAS (no optimization): Avg = 65.31 vs MASPO Sequential MAS: 70.39
Sequential MAS + TPE: Avg = 66.49 vs MASPO Sequential MAS: 70.39
Sequential MAS + SPO: Avg = 66.56 vs MASPO Sequential MAS: 70.39
Hierarchical MAS (no optimization): Avg = 68.32 vs MASPO Hierarchical MAS: 71.05
Hierarchical MAS + TPE: Avg = 68.47 vs MASPO Hierarchical MAS: 71.05
Hierarchical MAS + SPO: Avg = 69.01 vs MASPO Hierarchical MAS: 71.05
SPO + Same Search Budget: Avg = 67.79 vs MASPO: 70.39
SPO + Same Gemini Budget: Avg = 67.86 vs MASPO: 70.39
Self-Optimized (Qwen3-8B optimizer+evaluator): Avg = 67.70 vs MASPO (Gemini-2.5-Pro optimizer+evaluator): 70.39
DeepSeek-V3 MAS (no prompt opt): Avg = 71.79 vs + Optimized Prompt: 75.86
GLM-4.6 MAS (no prompt opt): Avg = 75.61 vs + Optimized Prompt: 78.41
Claude-Sonnet-4 MAS (no prompt opt): Avg = 77.58 vs + Optimized Prompt: 79.73
Gemini-2.5-Pro MAS (no prompt opt): Avg = 84.93 vs + Optimized Prompt: 87.14

Limitations

Optimization and test data overlap risk: The 50-sample pool used for prompt optimization is drawn from benchmark datasets, and the paper does not explicitly confirm these 50 samples are held out from the reported test-set accuracy figures—if they overlap, reported gains could be inflated.
No statistical significance testing: All results are single point estimates; no bootstrap confidence intervals, t-tests, or multiple-run variance are reported, making it difficult to assess whether +2.90 average improvement is robust or within noise on smaller benchmarks.
Reliance on a frontier LLM as optimizer/evaluator: The primary results use Gemini-2.5-Pro for both Mopt and Meval, which is expensive and may not be accessible in production settings; the self-optimized Qwen3-8B variant drops 2.7 points on average, suggesting significant capability dependence.
Ablations restricted to Sequential MAS on Qwen3-8B: All 8 ablation dimensions (Table 2) are run on one topology and one backbone; it is unclear whether component contributions hold for Hierarchical MAS or other agent graph structures.
Fixed graph topology assumption: MASPO assumes the communication graph G is given and fixed; it does not address topology optimization or dynamic agent routing, limiting applicability to systems where the graph is known a priori.
No adversarial or distribution-shift evaluation: All benchmarks use in-distribution queries; there is no test of prompt robustness to out-of-distribution inputs, prompt injection, or adversarial queries—relevant if deployed in bot-defense pipelines.
Compute cost not fully quantified: While the paper demonstrates compute-controlled ablations against SPO, the absolute cost of MASPO (number of Gemini API calls per optimization run) is not reported in the main text, making real-world deployment budgeting difficult.

Open questions / follow-ons

Can MASPO's misalignment-aware optimization be extended to dynamic or emergent graph topologies (e.g., where agent routing is itself learned), and does the joint reward remain well-defined when the graph structure changes mid-optimization?
The paper shows that 1-step lookahead plus global alignment suffices (2- and 3-step lookahead add negligible gain), but the theoretical reason is not established—is this because Gemini-2.5-Pro's global alignment signal is already high-quality, or because long-range dependencies in the tested benchmarks are shallow? Would deeper tasks (e.g., multi-hop research synthesis) require deeper lookahead?
Misalignment cases are mined using the same LLM judge that evaluates candidates—how robust is this to judge miscalibration or systematic biases in Gemini-2.5-Pro's preference judgments, and would a disagreement-based ensemble of evaluators improve reliability?
The 50-sample optimization pool is fixed per benchmark; how does pool composition (domain diversity, difficulty distribution) affect optimized prompt quality, and can active learning or uncertainty sampling improve sample efficiency further?

Why it matters for bot defense

Bot-defense and CAPTCHA systems increasingly use LLM-based multi-agent pipelines for tasks like abuse signal classification, challenge difficulty calibration, behavior anomaly summarization, and risk scoring—each handled by a specialized agent in a sequential or hierarchical chain. MASPO is directly relevant as a tool for automatically refining the system prompts that govern each agent's role without requiring labeled intermediate outputs. The misalignment mining mechanism is particularly useful in abuse-detection chains where an upstream 'evidence extraction' agent may produce outputs that look locally coherent but systematically mislead a downstream 'verdict' agent—exactly the Local-Global Misalignment failure mode MASPO targets. The compute-controlled ablation confirms that MASPO's gains are algorithmic, not just API-call inflation, which matters for cost-sensitive production deployments.

Practical caveats for a bot-defense engineer: First, the framework's reliance on Gemini-2.5-Pro as meta-optimizer introduces a dependency on an external frontier model, which may raise data-privacy concerns if abuse traces or user behavior signals are fed as optimization context. The self-optimized Qwen3-8B variant mitigates this but with a ~2.7-point average performance drop. Second, MASPO is evaluated only on clean, in-distribution benchmarks; bot-defense prompts must be robust to adversarial inputs and prompt-injection attempts, neither of which is tested here. Third, the 50-sample pool assumption may be unrealistic in cold-start abuse scenarios where labeled or even unlabeled representative queries are scarce. A practitioner should treat MASPO as a strong prompt-optimization baseline for MAS pipelines but should independently evaluate prompt robustness under distribution shift and adversarial conditions before production deployment.

Cite

bibtex

@article{arxiv2605_06623,
  title={ MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems },
  author={ Zhexuan Wang and Xuebo Liu and Li Wang and Zifei Shan and Yutong Wang and Zhenxi Song and Min Zhang },
  journal={arXiv preprint arXiv:2605.06623},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06623}
}

MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​