Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

Source: arXiv:2605.14241 · Published 2026-05-14 · By Kexin Chu, Dawei Xiang, Wei Zhang

TL;DR

This paper addresses the runtime provider-routing problem faced by tool-augmented LLM agents. When multiple functionally equivalent providers (e.g., web-search APIs, retrievers, LLM backends) expose the same interface, a gateway/router must choose which provider to call under fluctuating load and varying provider quality. The challenge is that providers differ in latency, reliability, and answer quality, and online gold labels for answer correctness are generally not available at deployment time. Prior approaches used additive reward models combining latency and quality, but these can misrank providers by allowing low-latency but poor-quality providers to dominate.

The authors propose LQM-ContextRoute, a novel contextual bandit routing algorithm that ranks providers by expected answer quality per service cycle, treating latency as a capacity cost rather than an additive penalty. They integrate query-specific quality estimates learned by a LinUCB model with an LLM judge providing proxy feedback on answer quality. This allows the router to dynamically adapt to changing load and heterogeneous provider quality without ground-truth labels at runtime. Empirically, LQM-ContextRoute outperforms strong baselines (including SW-UCB and ContextRoute) across multiple same-function pools (web search, StrategyQA, retriever pools) under realistic load patterns, improving accuracy/F1 by up to +18 percentage points and NDCG by 2.9–3.2 points while reducing latency and SLA misses. The work shows the importance of latency-quality matching rather than additive latency penalties when provider quality is heterogeneous and load changes dynamically.

Key findings

LQM-ContextRoute improves F1 by +2.18 percentage points over SW-UCB on the main web-search load benchmark while maintaining latency-quality Pareto efficiency.
In high-heterogeneity StrategyQA routing, LQM-ContextRoute increases accuracy by up to +18 pp versus SW-UCB, avoiding additive-reward collapse favoring low-quality fast providers.
On heterogeneous retriever pools (SciFact and NFCorpus), LQM-ContextRoute improves NDCG by +2.91 to +3.22 pp compared to SW-UCB while maintaining SLA > 95%.
LQM-ContextRoute cuts static-provider latency by 50–67% across load patterns STEP, ROTATION, and GRADUAL compared to static routing.
Switching from additive latency-quality reward to renewal-reward rate (quality/(1+latency/Lref)) improves correctness of provider ranking in scenarios with misaligned latency and quality (Theorem 2).
LQM-ContextRoute uses an LLM judge (LLAMA 3.2 1B) for online quality feedback, validated to recover 88–98% of accuracy compared to a 30B oracle scorer.
Addition of query-contextual LinUCB quality arms and latency estimation are key for adaption; ablations confirm the latency-quality matching score contributes most gains.
LQM-ContextRoute performs neutrally in stable-dominant provider pools (e.g., ReAct loop), showing it does not degrade performance when heterogeneity is low.

Threat model

The adversary is the uncontrolled runtime environment and workload variability affecting provider latency, availability, and quality. There is no assumed attacker manipulating responses maliciously, but the router must handle incomplete feedback and non-stationary loads that can degrade quality if misrouted. The router cannot query all providers simultaneously and only receives latency and quality feedback for the chosen provider per query, making learning bandit feedback and absence of gold labels core constraints.

Methodology — deep read

Threat Model and Assumptions: The adversary is implicitly the environment inducing runtime load and dynamic provider performance but no malicious tampering is assumed. Providers are functionally equivalent through a shared interface, but differ in latency, availability, and answer quality. The router lacks online gold labels and only receives bandit feedback—latency and quality signal for the chosen provider per query. The goal is to maximize quality under an operator-specified latency budget (service-time SLA). Load is non-stationary and provider quality may vary by query.
Data: The main benchmark is a curated set of 200 questions (100 HotpotQA + 100 TriviaQA) covering web search queries. For each query-provider pair, the provider is executed offline, and a Qwen3-30B reader extracts an answer scored by F1 to build a provider-response score matrix. Latency data is drawn from measured service profiles of three search APIs (Tavily, Brave, DuckDuckGo) covering warm, loaded, overloaded states over multiple load patterns (STEP, ROTATION, SPIKE, GRADUAL). Additional datasets include StrategyQA for high heterogeneity, two retriever pools (SciFact and NFCorpus) for retrieval routing, and an LLM-backend ladder.
Architecture / Algorithm: LQM-ContextRoute is a contextual bandit model with one LinUCB arm per provider. For each provider i, it maintains:

A linear contextual quality estimator ui(x) = x^T A_i^{-1} b_i predicting answer quality given query embedding x.
An exponential moving average (EMA) estimate of provider latency τ_i.
An uncertainty estimate (LinUCB exploration bonus) ci(x).

The key innovation is the provider scoring function combining expected quality per service cycle: score_i = ui(x) / (1 + τ_i / L_ref) + α_ucb * ci(x) / (1 + λ * max(0, max_j uj(x) - ui(x))) It treats latency as a capacity cost (denominator) rather than an additive penalty. The exploration bonus is asymmetrically deflated to reduce exploration of arms estimated worse than others for the current context. L_ref is a latency budget parameter aligned with SLA.

Training Regime: Routing is performed online over T=200 rounds (queries) repeated over 50 different random seeds. Each round:

Query q_t is embedded into vector x_t.
Providers score(x_t) computed.
Provider i_t with max score selected.
Latency τ_{i_t} and judge quality feedback u_{i_t} ∈ [0,1] observed.
Latency EMA and LinUCB parameters A_i, b_i updated for chosen provider. Unchosen arms have no update due to bandit feedback. The LLM judge model is LLAMA-3.2-1B used as a proxy reward.

Evaluation Protocol: Metrics include answer quality (F1, accuracy, NDCG) and mean latency/SLA fraction under 1.5 sec SLA threshold. Baselines include static (always pick one provider), round-robin, reactive cooldown routing, EMA-Greedy, SW-UCB, and a prior art contextual bandit router (ContextRoute). Ablations test scoring-only and context-only submodules. Multiple non-stationary load patterns test dynamic load adaptation.

Statistical significance and detailed ablations analyze the contribution of latency-quality matching vs additive rewards, exploration bonus modulation, and query-contextuality. Provider heterogeneity slices validate when improvements emerge.

Reproducibility: Code and detailed experimental conditions are not explicitly released. The main benchmark provider-response data uses offline Qwen3-30B reader answers, and latency profiles come from proprietary measurements. The judge model is publicly known (LLAMA-3.2-1B). Random seeds and hyperparameters (e.g., λ=1, α_ucb) are specified. The algorithm pseudocode (Algorithm 1) is provided. Some details about latency calibration and quality proxies are in appendices but not fully open-sourced.

Technical innovations

Formulation of same-function provider routing under runtime load as a constrained problem maximising expected quality under a latency budget rather than additive latency-quality reward.
Introduction of a latency-quality matching score based on the renewal-reward rate ui/(1+τi/Lref), treating latency as service-cycle cost.
Integration of query-specific quality estimation with LinUCB contextual bandits combined with LLM-as-judge quality feedback providing proxy rewards without gold labels.
Exploration bonus deflation mechanism to reduce exploration on providers predicted to be dominated for the current query context.
Empirical demonstration that additive latency penalties cause routing failures favoring low-quality fast providers, avoided by the renewal-rate objective.

Datasets

WebSearch Benchmark — 200 queries (100 HotpotQA + 100 TriviaQA) — internal with Qwen3-30B reader scoring
StrategyQA — unspecified size, high heterogeneity question answering benchmark — public (Geva et al., 2021)
SciFact Retriever Pool — 300 test claims — from BEIR (Thakur et al., 2021)
NFCorpus Retriever Pool — unspecified queries — internal corpus with 3 retrievers
LLM Provider Ladder — unspecified size — internal
Live Search Provider Latency Profile — 270 calls from Tavily, Brave, DuckDuckGo — internal measurements

Baselines vs proposed

SW-UCB: main web-search F1 = 0.581 vs LQM-ContextRoute F1 = 0.603 (+2.18 pp)
SW-UCB: StrategyQA accuracy = 0.413 vs LQM-ContextRoute = 0.601 (+18 pp)
SW-UCB: SciFact retriever NDCG = 0.672 vs LQM-ContextRoute = 0.717 (+2.91 pp)
SW-UCB: NFCorpus retriever NDCG = 0.294 vs LQM-ContextRoute = 0.326 (+3.22 pp)
Static-T1 (always Tavily): latency 783 ms vs LQM-ContextRoute 386 ms on STEP load, F1 improvement by 0.036
ContextRoute: main F1 = 0.598 vs LQM-ContextRoute = 0.603 (+0.85 pp)
LQM-ONLY ablation: StrategyQA accuracy = 0.547 vs LQM-ContextRoute = 0.601 (+5.4 pp)

Limitations

Main benchmark relies on pre-recorded provider-response scores, so does not simulate full live traffic or reflect agent planning complexity.
Non-search evaluation datasets and LLM-provider pools are smaller scale and may not represent production heterogeneity breadth.
Assumes a same-function provider pool with identical interfaces; does not address multi-tool selection or schema mismatches.
Online quality feedback depends on a reliable LLM judge proxy which may not generalize or be perfectly calibrated.
No explicit adversarial evaluation or robustness to provider faults beyond load simulations.
No open-source code or weight release; some latency profiles and quality labels come from private/internal data.

Open questions / follow-ons

How would the router perform and adapt under malicious or adversarial provider manipulations rather than just runtime load variations?
Can improved or multi-modal online quality estimation methods reduce reliance on proxy LLM judges and further improve routing?
How might LQM-ContextRoute integrate with upstream multi-tool selection or pipeline planning rather than routing within one tool type?
What are the performance and fairness implications when scaling provider pools beyond a few options, or under stricter latency SLAs?

Why it matters for bot defense

Bot-defense and CAPTCHA systems increasingly rely on integrating multiple third-party services for challenge generation, verification, and user interaction via shared interfaces. This paper's methodology for intelligent provider routing under fluctuating load and heterogeneous quality is directly relevant to maintaining service reliability and user experience. By treating latency as a capacity cost and combining query-specific quality estimation with online proxy feedback, bot-defense gateways can better balance rapid response times against robustness and correctness of challenge generation or verification.

The insights around the failure modes of additive latency-quality trade-offs highlight risks in naively choosing faster but lower-quality CAPTCHA challenge generators or verification backends. Implementing latency-quality matching routing mechanisms can enhance overall system throughput while preserving security guarantees, especially under variable load or degraded provider conditions. For CAPTCHA engineers, LQM-ContextRoute’s framework offers a principled approach to adaptively optimize multi-provider use without requiring costly ground-truth labels during deployment.

Cite

bibtex

@article{arxiv2605_14241,
  title={ Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents },
  author={ Kexin Chu and Dawei Xiang and Wei Zhang },
  journal={arXiv preprint arXiv:2605.14241},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.14241}
}

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​