Skip to content

IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems

Source: arXiv:2603.17302 · Published 2026-03-18 · By Hongze Liu, Chang Guo, Yingzeng Li, Mengru Wang, Jiong Lou, Shijing Yuan et al.

TL;DR

This paper addresses the challenge of efficient and incentive-compatible routing in open, distributed multi-agent systems (MAS) built from large language model (LLM) inference agents. Existing approaches typically handle single-query routing without accounting for long-term resource reuse like KV caching or system-level many-to-many matching, and rely on generic incentive mechanisms that miss LLM-specific factors such as cache locality and stochastic latency. The authors introduce IEMAS, a distributed routing framework combining a probabilistic predictive model for quality-of-service (QoS) estimation under uncertainty with a VCG-based auction mechanism formulated as a Min-Cost Max-Flow (MCMF) problem. The design ensures truthful capability reporting and social welfare maximization while preserving cache-affinity to minimize redundant computation. Implemented atop the vLLM inference engine and evaluated on multi-turn dialogue, long context, and reasoning benchmarks, IEMAS achieves 35% lower average service cost and up to 2.9× latency reduction compared to multiple prior routing baselines. It also maintains strong incentive compatibility for strategic clients. The agentic hub architecture supports scalability by clustering agents and localizing auctions, reducing communication overhead and preserving proprietary internals. This work advances the state of agentic web ecosystems by jointly optimizing economic incentives and computational efficiency in LLM MAS scheduling.

Key findings

  • IEMAS achieves KV cache hit rates of 80.2% on CoQA multi-turn tasks versus 53.1% for the strongest baseline (GMTRouter).
  • Latency on HotpotQA reasoning benchmarks reduced from 2139.8ms (RouterDC baseline) to 284.2ms with IEMAS, a 7.5× improvement.
  • IEMAS reduces average service cost by approximately 35%, with CoQA cost dropping from 10.507 (MFRouter baseline) to 6.944.
  • The Hoeffding Tree online predictor achieves normalized mean absolute errors (NMAE) of 0.101 for latency and 0.090 for cost, enabling accurate QoS estimation under uncertainty.
  • VCG-based auction mechanism combined with MCMF ensures dominant strategy incentive compatibility for clients, verified against strategic bidding behaviors.
  • IEMAS satisfies weak budget balance, collecting sufficient client payments to cover agent service costs without system deficit.
  • Proxy hub clustering reduces global computation and communication cost while preserving social welfare and mitigating incentive rationality conflicts.
  • Experimental simulations show truthful agents achieve steadily increasing cumulative utility in the auction, while aggressive/conservative/random strategies incur penalties.

Threat model

The adversary is a self-interested client agent who may strategically misreport their utility (valuation) to gain favorable routing and pricing; agents are assumed honest due to verifiable, proxy-measured service costs. The mechanism's design focuses on ensuring incentive compatibility (truthfulness) for clients under incomplete information and stochastic LLM inference conditions. Adversaries cannot manipulate the proxy's cost measurements or the cache-affinity scoring mechanism.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is modeled as a self-interested client agent who might strategically misreport their true valuation for LLM inference services. Agents themselves are assumed honest regarding service costs due to verifiable cost measurement through the proxy. The goal is to ensure incentive compatibility for clients preventing manipulative bidding.

  2. Data: The experiments simulate a heterogeneous agent population composed of LLaMA-3-7B, Qwen-4B, and Qwen-8B models. Evaluation tasks include CoQA (multi-turn dialogue), QuAC (long-context QA), and HotpotQA (complex reasoning). Simulations leverage vLLM inference instrumented with realistic GPU-based constraints (batch size 12, GPU memory usage limit 0.6) to reflect computational restrictions common in LLM agent ecosystems.

  3. Architecture/Algorithm: IEMAS comprises three key components: (a) a resource-aware predictive model layer that uses Hoeffding Tree Regression and Classification to estimate latency, cost, and quality for each client-agent-task tuple under uncertain conditions, explicitly accounting for KV cache affinity through longest common prefix computations; (b) a VCG-auction-based bipartite matching mechanism formulated as a Min-Cost Max-Flow (MCMF) problem which finds welfare-maximizing many-to-many routing assignments between clients and agents subject to capacity constraints; (c) a proxy-hub distributed architecture that clusters agents in capability-based groups and localizes auctions to reduce communication overhead and maintain scalability.

  4. Training Regime: Predictors are seeded through brief warm-up phases involving representative multi-turn dialogues to reduce cold-start bias. During the main simulation runs, the proxy collects online latency, cost, and performance signals feeding back into the predictors for continuous learning. Hyperparameters for Hoeffding Trees are set to enable online incremental updates suitable for streaming environments.

  5. Evaluation Protocol: Metrics include KV cache hit rate, median latency, service cost, and predictive accuracy (NMAE). Baseline comparisons span five state-of-the-art LLM routing strategies: GraphRouter, GMTRouter, MFRouter, RouterDC, and random routing. Economic evaluations simulate strategic client bidding behaviors (honest, aggressive, conservative, random) across 100 auction rounds to verify incentive compatibility and social welfare outcomes. Experiments focus on both average-case and tail latency improvements.

  6. Reproducibility: The authors publish related code, facilitate open simulations on the vLLM platform with common LLM models, and provide detailed algorithmic descriptions. However, some datasets (e.g., proprietary agent capability profiles) and large-scale deployment data remain simulated, limiting exact reproduction of full-scale agentic web conditions.

Example end-to-end: For a batch of concurrent client requests on the CoQA dataset, each query is pre-clustered into appropriate proxy hubs. The proxy computes affinity scores (longest common prefix overlaps) between each query and cached states per agent. These features input into the Hoeffding Tree predictors to yield latency, cost, quality estimates. The proxy then constructs a bipartite weighted flow network with these scores, solves the MCMF problem to obtain welfare-maximizing task-agent matches, computes VCG payments motivating truthful client valuations, then dispatches queries to assigned agents preserving cache locality and minimizing redundant computation. Observed latency and cost feedback updates the predictors online for future rounds.

Technical innovations

  • Introduction of a cache-affinity scoring mechanism based on longest common prefix overlap to explicitly preserve KV cache locality in open LLM agent routing.
  • Integration of a probabilistic QoS predictive model using Hoeffding Tree regressors/classifiers to estimate latency, cost, and quality under stochastic LLM inference conditions.
  • Formulation of the client-agent routing problem as a Min-Cost Max-Flow bipartite b-matching optimized with VCG payments to achieve incentive-compatible, welfare-maximizing allocations.
  • Design of a proxy hub architecture that clusters agents and localizes auctions, enabling scalable distributed incentive-efficient scheduling with privacy preservation.
  • Demonstration of weak budget balance and exact allocative efficiency combining MCMF with VCG in heterogeneous, capacity-constrained LLM ecosystems.

Datasets

  • CoQA — ~8,000+ conversational QA dialogues — public
  • QuAC — ~14,000 QA dialogues with long context — public
  • HotpotQA — ~113,000 multi-hop reasoning QA pairs — public

Baselines vs proposed

  • GraphRouter: CoQA cost = 10.655 vs IEMAS = 6.944
  • GMTRouter: CoQA KV cache hit rate = 53.1% vs IEMAS = 80.2%
  • MFRouter: QuAC latency = 306.2ms vs IEMAS = 162.1ms
  • RouterDC: HotpotQA latency = 2139.8ms vs IEMAS = 284.2ms
  • Random routing: CoQA cost = 13.65 vs IEMAS = 6.944

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2603.17302.

Fig 1

Fig 1: The Illustration of Agentic Web Routing.

Fig 2

Fig 2: IEMAS Overview. (a) Coarse-Grained Clustering: Incoming web queries are first allocated to specific Agent Hubs via a fast,

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

  • The agent service costs are assumed verifiable and honest; strategic behavior of agents themselves is not explored.
  • Experimental evaluation is conducted via simulation rather than deployment over real-world open multi-agent networks.
  • IEMAS relies on relatively stable agent clusters and does not evaluate highly dynamic agent population changes or adversarial collusion.
  • Predictive models focus on latency, cost, and accuracy but do not model other QoS aspects such as reliability or fairness explicitly.
  • Budget balance is weak (non-deficit), but the classic impossibility theorem implies full incentive, rationality, and balance cannot simultaneously hold in all scenarios.
  • Evaluation datasets do not cover all possible LLM application domains, limiting generalizability to certain inference workloads.

Open questions / follow-ons

  • How to extend incentive compatibility guarantees to strategic agent providers who may misreport capabilities or abort tasks.
  • Adapting IEMAS to highly dynamic agent ecosystems with churn, unreliable participants, or collusive behaviors.
  • Incorporating richer QoS metrics like reliability, fairness, and privacy compliance into the incentive-efficiency co-design.
  • Exploring theoretical limits of budget balance and incentive rationality trade-offs in multi-agent LLM markets with heterogeneous resources.

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, IEMAS illustrates a rigorous incentive-aware scheduling approach in decentralized multi-agent LLM ecosystems that jointly optimizes service quality and cost-efficiency while handling non-cooperative participants. Although the focus is on agentic web inference routing rather than direct bot detection, principles such as truthful capability revelation, leveraging local state (cache affinity) for routing efficiency, and distributed auction-based matching can inspire secure coordination among distributed challenge providers or adaptive CAPTCHA delivery infrastructure.

By explicitly modeling stochastic latency and cost with economic incentives, the approach offers guidance on mitigating misuse or abuse risks posed by strategically-acting clients in open AI service platforms. The proxy-hub architecture further highlights design patterns for privacy-preserving decentralized control without centralized command, a key aspect for large-scale CAPTCHA deployment sensitive to adversarial probing or resource exhaustion. Thus, bot-defense engineers can adapt ideas from IEMAS's mechanism design and resource-aware prediction to strengthen adaptive puzzle routing and verification systems that must balance user experience with cost under adversarial conditions.

Cite

bibtex
@article{arxiv2603_17302,
  title={ IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems },
  author={ Hongze Liu and Chang Guo and Yingzeng Li and Mengru Wang and Jiong Lou and Shijing Yuan and Hefeng Zhou and Chentao Wu and Jie LI },
  journal={arXiv preprint arXiv:2603.17302},
  year={ 2026 },
  url={https://arxiv.org/abs/2603.17302}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution