Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

Source: arXiv:2606.19992 · Published 2026-06-18 · By Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

TL;DR

This paper addresses a fundamental inefficiency in how large language model (LLM) agents interact with web services to execute multi-step, procedural workflows. Current agentic web services expose static, single-shot endpoints that force agents to interleave network calls with reasoning and decision-making at every step, resulting in high latency, excessive client-server traffic, brittle failure recovery, and inefficient data fetching. The authors propose a new abstraction called tool programs, which encode an entire multi-step service interaction, including control flow constructs like loops and conditionals, as an executable program with explicitly annotated effect types (READ vs WRITE). This program is then executed atomically by a trusted service-side runtime that can optimize execution, enforce exactly-once semantics for state-changing calls, and safely retry on partial failure.

The TOOLPRO system implements this idea in practice by combining constraint-guided program construction to ensure LLM-generated programs compile and run correctly, effect-aware replay to avoid duplicating write side effects during repair-driven retries, and a profile-driven policy to adaptively decide when executing a consolidated program is more efficient than repeated stepwise calls. Evaluated on multiple real-world workloads spanning read-only and read-write scenarios from open-source MCP-style applications, TOOLPRO demonstrates latency reductions up to 53.4% and client-side traffic reductions up to 96.1%, with larger gains for workflows with higher network latency and complexity. These results validate that encapsulating multi-step agent intent as executable tool programs unlocks superior end-to-end performance, reliability, and scalability for flexible agent orchestration of web services.

Key findings

TOOLPRO reduces end-to-end latency by up to 53.4% compared to the stepwise MCP Web Service (MWS) interface across real workflows (Section 4.3, Fig 3).
Client-side traffic is cut by up to 96.1% primarily by reducing repeated client-to-LLM exchanges (Section 4.3, Fig 4).
TOOLPRO improves task accuracy on complex realistic workflows from 0.60 to 0.93 in cbench1-3 and from 0.20 to 0.80 in cross-service benchmark cbench4 (Section 4.3).
Disabling effect-aware replay raises average latency by +19.7% and fallback frequency from 0/15 to 3/15 runs in retry scenarios (Section 4.3).
Under higher network latency (RTT 100ms to 200ms+), TOOLPRO’s latency improvements over MWS widen, justifying the consolidation policy’s adaptivity (Section 4.4, Fig 5).
Under low latency conditions (RTT < 100ms), TOOLPRO policy defaults more to stepwise mode, avoiding build costs and maintaining robust performance (Section 4.4).
TOOLPRO’s profile-driven consolidation policy accurately predicts when tool program execution is more efficient than stepwise calling based on latency components and workflow length N (Section 3.4).
LLM coding success rates vary by model: qwen3-coder-flash achieves 80% while GPT-5.1 and Gemini-3-flash-preview reach 100% success with no fallback (Section 4.3).

Threat model

The adversary model is implicit: the system assumes an honest-but-faulty software environment where LLM-generated tool programs may contain errors (syntactic, semantic, or runtime failures) but are not malicious. The runtime must prevent duplicated or inconsistent state modifications from partial failures and retries, enforcing exactly-once semantics for state-changing calls. The execution is sandboxed (WebAssembly) to prevent unsafe operations beyond tool interface calls. There is no explicit adversary attempting to subvert the tool program interface or escalate privileges.

Methodology — deep read

Threat Model & Assumptions: The system assumes an LLM-based client agent that orchestrates multi-step workflows by invoking a set of static endpoints exposed by web services. The adversary is not explicitly modeled in this work, but safety is needed to prevent duplicated side effects from retries under partial failure, and execution must be sandboxed with constrained program interfaces. The underlying service may fail or return errors. The paper assumes each endpoint has a known effect annotation (READ or WRITE) indicating whether it modifies state.

Data: Not empirical data-driven in conventional sense. Instead, the workload evaluation uses three open-source applications (Memos, Directus, MinIO) exposed as MCP-style web services with defined endpoints. Procedural workflows requiring loops, conditionals, and intermediate bindings were constructed, each parametrized by N representing the number of endpoint calls. Both read-only and read-write variations were tested. Supplemental benchmarks stress nondeterministic retrieval, branching, coordinated writes, non-idempotent effects, and cross-service workflows.

Architecture / Algorithm: Tool programs are executable workflows submitted by the client representing multi-step endpoint interactions with control flow and explicit effect-type annotations (READ/WRITE) at each call site. TOOLPRO’s server-side runtime enforces a constrained program surface disallowing arbitrary code—for example, only structured control flow, no exceptions or unsafe operations, and all interactions go through a unified CALL(e,a,eff) interface. The runtime compiles and executes the program inside a WebAssembly sandbox.

Three key mechanisms enable practicality:

Constraint-Guided Construction: The client synthesizes candidate programs with lightweight checks (endpoint coverage, control-flow skeleton, value-flow sanity). The server applies a deterministic projection rewriting code to the constrained surface, compiles and executes it, and if failures occur, leverages detailed compiler and runtime feedback to attempt bounded in-place repairs (e.g. fixing missing symbols or correcting control flow) up to an attempt budget. If repair fails, fallback to stepwise calling occurs with diagnostics.
Effect-Aware Replay: To enforce exactly-once semantics for WRITE calls under repair and re-execution, the runtime intercepts every dynamic CALL. READ calls are always forwarded; WRITE calls are replayed from a per-intent-instance log of previously completed WRITE outcomes. Replay is only enabled if new executions do not alter arguments/order of committed WRITE prefixes. If inconsistencies arise, fallback occurs to stepwise execution.
Profile-Driven Consolidation Policy: A lightweight online policy predicts when to consolidate the workflow into a tool program based on runtime profiles of network RTT (TRTT), client decision overhead (TDEC), and program build cost (TBUILD). Given estimated number of endpoint calls N, a cost model determines if savings from fewer round trips outweigh build overhead. Early bootstrap runs initialize profiles. This adaptive policy enables robust performance across heterogeneous conditions.

Training / Execution Regime: Not a ML trained model paper. LLMs generate programs on client side from intent/task + tool specs. The runtime executes Wasm programs on a sandboxed server. The repair loop is bounded and feedback-driven with fixed attempt budgets.

Evaluation Protocol: Metrics are end-to-end latency (total time spanning client LLM inference, client-server network, server execution) and client-side traffic (bytes for client-server requests and client-LLM prompts). Baselines include fully stepwise MCP Web Service calling (MWS), TOOLPRO-step (stepwise with intent-structured guidance, no consolidation), TOOLPRO-prog (always program mode, no policy), and full TOOLPRO (with adaptive consolidation). Evaluations run on real MCP applications with workflows parameterized by N=10, with additional complex workflows cbench1-4. Sensitivity analyses explore network RTT variation (adding delays and geographic deployments) and workflow complexity.

Reproducibility: The paper includes a public code release at https://github.com/morgen52/toolpro_icml26. Benchmarks use open-source MCP applications. LLM models used include qwen3-coder-flash, GPT-5.1, Gemini-3-flash-preview. Details on experimental environment and workload construction are included in appendices. Exact seeds or randomization details for program synthesis not described.

Example End-to-End Workflow: For a read-write workflow (e.g., book highest-rated hotel): the client LLM synthesizes a tool program that loops over hotel listing endpoints, collects details, selects the best hotel, and issues a booking call. This program is checked client-side, projected and compiled to Wasm server-side, executed with effect-aware call interception to guard WRITE booking. On partial failure, repair feedback loops and effect replay ensure no double book. The profile policy, based on measured RTT and complexity, determines whether to execute the program or fallback to stepwise calls. Performance gains come chiefly from reducing repeated network round trips and minimizing prompting.

Uncertainties: Some repair heuristics and exact program synthesis prompt details are not deeply elaborated. The cost model for consolidation is parameterized from lightweight profiles, but accuracy/stability details are limited. The LLM models’ influence on program correctness and fallback rates is experimentally characterized but not deeply analyzed. The interface design enforces strong constraints for safety, but its impact on expressiveness is not quantified.

Technical innovations

Introduction of tool programs as executable multi-step workflows with explicit READ/WRITE effect typing to replace brittle static endpoint sequences.
Constraint-guided construction combining client-side synthesis with server-side deterministic projection and feedback-driven bounded in-place program repair to ensure reliable execution under strict sandbox constraints.
Effect-aware replay mechanism maintaining exactly-once semantics by logging and replaying WRITE calls across repair-driven re-executions, preventing duplicated state changes.
Profile-driven adaptive consolidation policy using online latency and cost estimates to decide when the build overhead of tool programs justifies replacing stepwise calling.

Datasets

Memos — size not explicitly stated — open source MCP-style service
Directus — size not explicitly stated — open source MCP-style service
MinIO — size not explicitly stated — open source MCP-style service
Supplemental realistic workflows cbench1–cbench4 — constructed benchmarks covering nondeterminism, branching, coordinated writes, cross-service interaction

Baselines vs proposed

MWS (stepwise MCP Web Service): End-to-end latency ~16.8–24.7s across workflows vs TOOLPRO: latency reduction up to 53.4% (e.g., cbench4 from 52.68s to 24.54s)
MWS client-side traffic: up to 96.1% reduction by TOOLPRO
TOOLPRO-step vs TOOLPRO-prog: TOOLPRO-step better at low network latency (1–100ms RTT), TOOLPRO-prog better at high latency (1000–2000ms RTT)
Disabling effect-aware replay in TOOLPRO increases latency by 19.7% and fallback rate from 0 to 3 out of 15 runs
Task accuracy on complex workflows improved from 0.60 (MWS) to 0.93 (TOOLPRO) in cbench1–3 and from 0.20 to 0.80 in cbench4

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.19992.

Fig 1

Fig 1: From static endpoints to tool programs. Stepwise endpoints force the agent to repeatedly call endpoints and re-prompt to

Fig 2

Fig 2: Three mechanisms of TOOLPRO.

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Limitations

The approach relies heavily on the correct synthesis of tool programs by LLMs, with fallback needed if compilation or repair fails; success rates vary by LLM model.
TOOLPRO assumes strict effect typing (READ vs WRITE) and conservative replay, which may limit applicability to services with complex or undocumented side effects.
Evaluation focuses on benchmark procedural workflows on open-source MCP services; broader evaluation on large-scale or commercial agentic web services is not presented.
Effect-aware replay disables program mode and falls back to stepwise calling when idempotency or argument/order changes occur — this conservative approach may limit efficiency gains in some nondeterministic or complex scenarios.
The cost model and profile-driven consolidation rely on stable latency characteristics; rapidly changing network conditions or workloads with dynamic procedural complexity could challenge robust mode selection.
No adversarial or security threat evaluation is provided regarding malicious LLM-generated programs or sandbox escape attempts.

Open questions / follow-ons

How to extend tool programs and effect-aware replay to more complex or distributed stateful services with more nuanced side effect semantics?
Can the program synthesis and repair mechanisms be improved to increase success rates and reduce fallback, especially for more expressive programming interfaces or heterogeneous LLM models?
How does TOOLPRO interact with adversarial or corrupted tool programs? What security mechanisms beyond sandboxing can ensure integrity and compliance?
Could adaptive or learned consolidation policies improve over the current heuristic latency model to better handle dynamic workloads and network variations?

Why it matters for bot defense

This work is highly relevant for bot-defense and CAPTCHA practitioners interested in enabling secure, efficient, and robust automation workflows that involve programmatic web service interactions. The move from brittle stepwise endpoint calls to consolidated executable tool programs can substantially reduce client-server round trips and data leakage risk inherent in repetitive prompting. Effect-aware replay mechanisms ensure that retries—common in unreliable network or adversarial settings—do not result in duplicated side effects or inconsistent states, a critical property for preserving secure-state integrity when automated bots interact with sensitive backend services.

From a bot-defense perspective, TOOLPRO's approach raises new considerations for threat modeling. Executing rich multi-step programs server-side under a constrained sandbox could expose new attack surfaces if malicious programs manage to bypass or subvert constraints. Therefore, defense infrastructure must carefully vet synthesized tool programs and ensure robust sandboxing and effect semantics enforcement. Conversely, the structured interface and explicit effect typing offer observable contracts that enable better detection and monitoring of suspicious or anomalous automation behavior compared to opaque stepwise endpoints. Overall, practitioners designing agent-facing automated services would benefit from embracing programmatic interaction patterns similar to TOOLPRO to reduce latency and client-side fingerprinting while ensuring side-effect safety—a valuable direction for future CAPTCHA-resilient bots and agentic web systems.

Cite

bibtex

@article{arxiv2606_19992,
  title={ Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services },
  author={ Mugeng Liu and Shuoqi Li and Yixuan Zhang and Yun Ma },
  journal={arXiv preprint arXiv:2606.19992},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19992}
}

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​