Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Source: arXiv:2605.21470 · Published 2026-05-20 · By Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis

TL;DR

This paper addresses the substantial latency and accuracy limitations in existing computer-use agents (CUAs) that automate complex web tasks from natural language instructions. Prior CUAs execute sequential loops invoking large language models (LLMs) at each step to decide actions such as clicks or typing, resulting in high latency (often minutes) and frequent failures due to incorrect tool ordering and non-deterministic execution. The authors propose agent just-in-time (JIT) compilation, which directly translates natural language tasks into optimized executable code at runtime, allowing static verification of tool preconditions/postconditions, caching reusable tool implementations, and principled parallel scheduling of subtasks. Their system consists of (1) a JIT-Planner that generates multiple code plans, validates them with a novel invariant-enforcing tool protocol, estimates costs with control flow graphs, and selects the minimum-cost plan; (2) a JIT-Scheduler that selects an adaptive parallelization strategy via Monte Carlo sampling from learned latency distributions; and (3) a formal tool interface protocol to enforce compositional correctness of action sequences. Extensive evaluations on 37 diverse web automation tasks across 5 real applications demonstrate that JIT-Planner achieves on average a 10.4x speedup and +28% absolute accuracy improvement over a strong baseline Browser-Use agent, and JIT-Scheduler improves latency by 2.4x and accuracy by +9% over state-of-the-art commercial CUAs from Anthropic and OpenAI. The paper shows the critical impact of static cost-aware plan selection and scheduling for reducing redundant LLM calls and avoiding common failure modes in web automation agents.

Key findings

JIT-Planner achieves a 10.4× speedup and +28% accuracy increase over Browser-Use baseline across 5 web apps and 37 tasks (Table 1).
Use of invariant-enforcing tool protocols reduces plan failure rate from 80% to 43%, mainly by cutting tool ordering errors from 59% to 25% (Figure 7).
Best-cost plans identified by the planner have 5.3× lower latency than worst-cost plans, driven largely by reduction in unnecessary LLM calls and nested loops (Table 1, Figure 6).
JIT-Scheduler’s Monte Carlo cost estimation enables adaptive selection among serial, parallel, or hedging strategies, yielding 2.4× speedup and +9% accuracy improvement over OpenAI CUA (Figure 10).
Across task cardinality levels, JIT-Planner speedups vary between 8.7× to 11.8×, indicating optimization benefits scale with task complexity.
Parallel hedging plan generation (8 workers) achieves 92% pass@20s latency budget, outperforming serial retry’s 68% success rate (Figure 9).
Latency variance among static scheduling strategies can be as high as 1.5× due to nuanced task structure impacts.
Cache staleness from UI changes detected at runtime via invariant checks enables graceful fallback and re-planning.

Threat model

The adversary is the complexity and variability of web environments causing nondeterministic UI states, unexpected page changes, and stochastic interaction latencies, which can result in failures if tool sequences violate state invariants or scheduling is suboptimal. The system assumes programmatic access to web application state for precondition checking and no active adversarial interference such as deliberate detection or poisoning.

Methodology — deep read

The authors target a threat model where web automation agents execute programmed sequences of tool calls (e.g., click, type) on a browser to complete user tasks, and optimize to minimize latency and execution errors caused by incorrect tool ordering and inefficient scheduling. Adversaries here correspond to the system complexity and variability of web environments, not malicious actors.

Datasets come from two major benchmarks: REAL (Garg et al., 2025) including Dashdish (food delivery), Gomail (email), and Omnizon (e-commerce), plus WebArena (Zhou et al., 2024) with GitLab and Reddit tasks. A total of 37 tasks were manually curated, varying in cardinality and step length. Each configuration was run across 3 trials.

The core architecture introduces a JIT compiler for agent plans that compiles natural language tasks into code consisting of cached reusable tools rather than primitive click/type actions. The planning phase (JIT-Planner) uses a language model to sample candidate plans in parallel across multiple workers. Each plan is verified against an invariant-enforcing tool protocol that specifies preconditions and postconditions of each tool call, allowing static checking of state flow correctness and type safety. Plans are represented as control flow graphs (CFGs), traversed to validate invariants and accumulate cost estimates penalizing nested LLM calls to approximate latency.

Plan candidates are generated and checked repeatedly until a configurable number (k) of valid plans are collected or timeout. Cost-estimation uses a decay factor γ=10 applied per CFG depth to weight expensive LLM usage in loops. The minimum-cost valid plan is selected as final.

The scheduling component (JIT-Scheduler) predicts element usage patterns per candidate schedule (Serial, Parallel, Hedge) using a separate LLM model. Monte Carlo sampling from offline-learned latency distributions for DOM elements is performed for each strategy over multiple trials. The strategy minimizing expected latency plus overhead is selected dynamically per task.

Offline infrastructure is used to synthesize tool code from execution traces for caching, and to learn latency distributions from recorded browser interactions.

Reproducibility: Code is not publicly released per the paper, but detailed pseudo-code algorithms and protocol specifications are provided. The evaluations use state-diff checking and pre-defined oracle schedulers as upper bound baselines.

End-to-end example: For a task like “order cheapest Taco Bell item,” the planner generates multiple code plans combining cached tools (list restaurants, select menu item, place order). It statically rejects plans violating preconditions (e.g., accessing item before listing restaurants) and penalizes plans with unnecessary LM calls. The scheduler then predicts latencies for serial, parallel, and hedge executions using Monte Carlo estimation over cached latency models and picks the best. This plan executes with significantly fewer LM calls and less latency than prior sequential agent actuation loops.

Technical innovations

Introduction of an invariant-enforcing tool protocol specifying preconditions and postconditions enabling static compositional verification of tool sequences at compile-time.
Application of just-in-time compilation to compile natural language tasks into optimized executable code plans, allowing static cost estimation and plan selection.
Monte Carlo-based cost-aware scheduling over learned latency distributions, enabling adaptive selection among serial, parallel, and hedging execution strategies.
Use of control flow graph traversal combined with decay-penalized LLM call counting to estimate latency cost of candidate plans prior to execution.

Datasets

REAL benchmark — 27 curated tasks in Dashdish, Gomail, Omnizon — public benchmark (Garg et al., 2025)
WebArena benchmark — 10 tasks from GitLab and Reddit applications — public benchmark (Zhou et al., 2024)

Baselines vs proposed

Browser-Use baseline (Yang et al., 2024): Latency = 122.1s, Accuracy = 61% vs JIT-Planner: Latency = 11.7s, Accuracy = 89% (+28%)
Browser-Use + Cache: Latency = 80.1s, Accuracy = 88% vs JIT-Planner: Latency = 11.7s, Accuracy = 89%
Anthropic CUA: Latency = 141.7s, Accuracy = 79% vs JIT-Scheduler (Gemini-2.5-Pro): Latency = 109.9s, Accuracy = 86%
OpenAI CUA: Latency = 258.7s, Accuracy = 77.8% vs JIT-Scheduler: Latency = 109.9s, Accuracy = 86%
Fixed scheduling strategies (Serial, Parallel, Hedge) vary by up to 1.5× in latency and 15% in accuracy; JIT-Scheduler adapts to achieve better overall trade-offs (Figure 10)

Limitations

Requires substantial offline setup per application: tool synthesis (25–90 min) and latency trace collection (25–45 min), limiting applicability to frequently used sites.
Cached tools can become stale or invalid after UI changes; fallback to re-planning incurs latency penalty.
Does not explicitly handle adversarial stochasticity such as CAPTCHAs, bot detectors, or rate limiting; stochastic failures cause fallback to generic slower options.
Evaluated only on web environments with programmatic state access; desktop or mobile environment applicability not demonstrated.
Code base and pretrained models are not publicly released, limiting reproducibility.
Evaluation uses manual task curation and only 3 trials per task, limiting statistical robustness.

Open questions / follow-ons

How can the invariant-enforcing protocol be extended to better handle stochastic or adversarial web conditions like CAPTCHAs or rate limiting?
Can the JIT compilation approach be adapted to desktop or mobile GUI automation with limited programmatic state inspection?
What are the trade-offs of increasing the number of candidate plans sampled in JIT-Planner on latency, accuracy, and resource usage?
How can online or dynamic revalidation of tool cache be efficiently implemented to handle frequent UI changes?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights critical challenges in automating complex web tasks reliably at low latency, which is precisely the domain where CAPTCHAs and other challenge-response mechanisms intervene to detect bots. The invariant-enforcing tool protocol and static plan validation techniques present a novel way to reduce nondeterministic behavior and failure rates of web automation agents, making them more robust and efficient. From a defense perspective, understanding that modern CUAs use cached reusable tools with strict pre/postcondition checks and adaptive scheduling suggests that defenses relying on nondeterministic UI changes or inducing tool misuse could remain effective.

Furthermore, the scheduler’s ability to choose parallelism and hedging strategies based on learned latency distributions indicates increased sophistication in evading detection mechanisms relying on simple timing heuristics. Researchers in bot defense should consider that attackers may improve automation success by employing JIT compilation approaches akin to those presented. This raises the bar for CAPTCHA and behavioral detectors, which might need to incorporate additional unpredictability or state obfuscation to counter such optimized agents. The paper’s findings therefore inform bot-defense researchers of emerging automation capabilities and suggest potential points of fragility (e.g., enforcing unpredictable UI state or invalidating cached tools) for robust CAPTCHA design.

Cite

bibtex

@article{arxiv2605_21470,
  title={ Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling },
  author={ Caleb Winston and Ron Yifeng Wang and Azalia Mirhoseini and Christos Kozyrakis },
  journal={arXiv preprint arXiv:2605.21470},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.21470}
}

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​