Skim: Speculative Execution for Fast and Efficient Web Agents

Source: arXiv:2605.16565 · Published 2026-05-15 · By Mike Wong, Kevin Hsieh, Suman Nath, Ravi Netravali

TL;DR

This paper addresses the high latency and cost inefficiencies in current LLM-based web agents that interact with live, purpose-built websites to automate complex multi-step tasks. Existing agents invoke heavyweight components—frontier LLM reasoning, full browser rendering, and iterative ReAct planning—at every step, regardless of step complexity, resulting in long task runtimes (30-120 seconds) and substantial API costs ($0.20-$0.50 per task). The authors observe that websites exhibit stable URL patterns, answer formats, and task-trajectory mappings for repeated query types, meaning many queries need not run the full expensive agent loop. Skim is proposed as a speculative execution framework that precomputes offline site profiles encoding URL templates, search semantics, answer schemas, and capability metadata. At runtime, Skim quickly matches incoming queries to templates, synthesizes destination URLs, retrieves pages with lightweight HTTP fetches or minimal browser rendering, and uses small extraction models to produce answers. A verifier module checks results for correctness, and failures cascade to the full ReAct agent warmed up from the fast path's final URL to preserve some progress. Experimental evaluation on standard benchmarks with three backbone agents shows Skim reduces median per-task cost by nearly half (1.9× reduction) and latency by one-third (33.4%) with no loss in accuracy. An alternate aggregate mode trades saved compute to run multiple speculative attempts, improving task accuracy by up to 16.7 percentage points. This approach leverages the semi-static structure of websites to specialize execution dynamically, avoiding uniform heavy inference and browser costs on all steps.

Key findings

Median per-task latency reduced by 33.4% across WebVoyager and WebShop benchmarks when using Skim with backbone agents BrowserUse, AgentOccam, and WebVoyager.
Median per-task cost reduced by 1.9× compared to full ReAct agent execution on the same benchmarks with no accuracy degradation.
Hand-engineered programs exploiting website structure achieve 66.7-94.9% faster execution and 17.7-100.7× lower cost than off-the-shelf ReAct agents on representative tasks without accuracy loss.
In 55.8% of benchmark tasks, answers can be retrieved directly via HTTP fetch without browser rendering, supporting lightweight execution.
66.7% of steps in typical web-agent trajectories are purely navigational, amenable to direct URL synthesis rather than multi-step reasoning.
Non-frontier extraction models nearly match frontier model accuracy when extracting answers from denoised (focused) page regions but degrade substantially on full page content (e.g., Qwen2.5-14B matches GPT-4o at 42.7% vs 26.1% accuracy on denoised vs full page).
Skim’s lightweight verifier effectively gates speculative outputs, and failed verifications warm-start the full agent at the speculative URL to recover progress.
Aggregating multiple speculative fast-path executions with verifier ranking can increase end-to-end accuracy by up to 16.7 percentage points within the baseline compute budget.

Threat model

The threat model focuses on the internal computational overhead and inefficiency of current web-agent execution rather than external adversarial attacks. The adversary can be viewed as the complexity and unpredictability of web navigation and dynamic content that cause repeated invocation of heavy frontier LLM inference, browser rendering, and multi-step planning. The system assumes sites have slow-changing structural properties that can be profiled offline and relies on runtime verification to catch mismatches. It does not consider adversarial sites actively trying to deceive or evade the agent or attempts to bypass bot defenses.

Methodology — deep read

The authors first define the threat and environment model as an existing web agent interacting with live, purpose-built websites to complete read-only query tasks via browsing and interaction. The adversary is the runtime environment complexity posing combinatorial inference and navigation costs; no adversarial tampering is considered. Data derives from standard web-agent benchmarks (WebVoyager and WebShop), covering 151 representative tasks across multiple sites such as Amazon, arXiv, GitHub, and others. Task text queries are mapped to desired information from live websites. Split details are not explicitly stated but evaluation covers diverse realistic queries.

Skim’s architecture comprises three major phases: offline site profiling, runtime speculative execution, and fallback escalation. The offline profiler generates reusable site profiles capturing typed URL templates (parameterized patterns for product pages, search filters, pagination), search semantics (query and filter handling), answer schemas (expected structure and format of result pages), and capability metadata (HTTP accessibility, JavaScript dependency, bot anti-detection signals). This profiling runs once per site through automated crawling, probing, and analysis using lightweight HTTP fetches escalated to full browser if needed. The profiler also synthesizes guidance for query construction with a local LLM assisting in learning which parameters improve retrieval.

At runtime, given a query, Skim’s URL synthesizer uses a non-frontier model combined with regex constraints from the profile to identify the appropriate URL template and extract site-valid parameter values from the natural language query. This produces a candidate destination URL that skips multi-step navigation. Skim determines the minimal execution resources per task step in three axes: page acquisition (direct HTTP fetch or full browser navigation), page rendering (HTTP-only or full browser render), and reasoning model (small extraction model vs frontier model). The speculative fast path fetches the page per minimal tier, extracts answers with the small model, and verifies the output’s validity against the query and answer schema via a lightweight verifier. If verification fails, execution escalates to the full ReAct agent, warm started at the fast path’s final URL to retain navigation progress.

Training of models used in Skim’s pipeline (URL synthesizer, resource predictor, verifier) involves using site-specific task collections with labeled templates, URL parameters, and answers. Hyperparameters, epochs, or hardware details are not specified in the paper. The extraction models include a small non-frontier LLM (e.g., Qwen2.5-14B) and a frontier model (e.g., GPT-4o) used as baseline.

Evaluation measures per-task latency, cost (in dollar terms attributable to LLM inference), and accuracy across three backbone agents (WebVoyager, AgentOccam, BrowserUse) and two benchmarks. They compare Skim’s accelerate mode (committing after first verified fast path) and aggregate mode (multiple speculative attempts ranked by verifier) against full ReAct execution. Ablations include hand-engineered optimized programs that hard-code URL templates and extraction regions. Statistical tests are not detailed; results show consistent latencies and costs reductions without accuracy loss across hundreds of tasks.

The system is reproducible via an offline profiling pipeline and modular runtime components, but code is not stated as publicly released. The data — live web pages and query benchmarks — are partly public but some profiling results or dependency on commercial backbone agents may limit full reproduction. End-to-end example: a query like “find the cheapest blue headphones under $100 on Amazon” generates a URL with encoded filters, fetches the page via HTTP, extracts the price region with a small model, verifies correctness by schema and query match, then returns the answer or else falls back to the ReAct agent starting at that product page.

Technical innovations

Offline automated site profiling to encode reusable URL templates, search semantics, answer schemas, and capability metadata enabling lightweight runtime specialization.
Runtime speculative execution framework that combines direct URL synthesis, minimal HTTP fetches, and small extraction models with lightweight verification gating to accelerate web-agent tasks.
Hybrid runtime template matching combining lightweight LLM intent classification with regex-based parameter validation for robust URL parameter extraction.
Fallback cascade that warm-starts the full iterative ReAct agent from the speculative path’s final URL to preserve navigational progress and avoid recomputation.

Datasets

WebVoyager benchmark — 151 tasks — public benchmark of realistic web-agent queries
WebShop benchmark — size not explicitly stated — public benchmark for web shopping tasks

Baselines vs proposed

Off-the-shelf ReAct agent: median latency = 30-120s, median cost = $0.20-0.50; Skim accelerated: latency reduced by 33.4%, cost reduced by 1.9x with no accuracy loss
Hand-engineered specialized programs: 66.7-94.9% faster and 17.7-100.7× cheaper than ReAct agent on same tasks with no loss in accuracy
Non-frontier extraction model Qwen2.5-14B: 42.7% accuracy on denoised pages vs 26.1% on full pages; GPT-4o frontier model achieves 45.7% and 45.0% respectively

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.16565.

Fig 1

Fig 1: Workflow of a representative ReAct-based web

Fig 2

Fig 2: Distribution of number of ReAct steps needed for

Fig 3

Fig 3: Per-step breakdown of latencies. Actions involve

Fig 4

Fig 4: Distribution of the percentage of steps per task that

Fig 5

Fig 5: Web state input at one step of a task. Red circle

Fig 6

Fig 6: Latencies of hand-

Fig 7

Fig 7: Costs of hand-

Limitations

Skim is designed and evaluated mainly on read-only web tasks with stable site structures; state-mutating or less stable sites are passed directly to full agents.
Site profile construction is offline and may require occasional re-profiling for structural drift; dynamic site changes not fully explored in evaluation.
Dataset coverage and benchmark diversity are limited to a set of popular sites; generalization to highly dynamic or non-purpose-built sites remains untested.
Verification depends on predefined answer schemas; complex or unpredictable answer formats might reduce effectiveness.
No evaluation under adversarial conditions, poisoned sites, or intentional bot-detection attempts was reported.
Details about training hyperparameters, model sizes beyond Qwen2.5-14B and GPT-4o, and hardware usage are sparse, limiting reproducibility.

Open questions / follow-ons

How well does Skim adapt to sites with frequent or unpredictable structural changes, or those using aggressive client-side rendering?
Can the verification mechanisms be extended to handle more complex or heterogeneous answer formats beyond structured schemas?
What is the impact of adversarial or misleading site content designed to trigger verification failures or force fallback to expensive execution?
How would the approach generalize to stateful interaction tasks that modify server-side state or session context?

Why it matters for bot defense

From a bot-defense and CAPTCHA engineering perspective, this paper reveals that large-scale LLM-based web agents still execute many redundant and expensive operations due to treating all navigation and extraction steps uniformly. Skim’s speculative execution with offline site profiling and runtime verification provides a powerful way to shortcut and accelerate these agents by exploiting site structure stability. This suggests that bot defenders could anticipate similar specialization optimizations and should consider how CAPTCHAs or dynamic challenges might affect such profile-driven URL synthesis and extraction pipelines.

Additionally, the tiered resource model and verification cascade illustrate a practical approach to combining fast approximations with fallback mechanisms, inspiring bot-defense architectures that trade off verification cost, response time, and fallback escalation when verifying user or agent authenticity. Verifiers that check output against expected answer schemas align conceptually with CAPTCHA challenges verifying user behavior consistency. However, since Skim depends on site stability and predictable patterns, defenses that introduce structural unpredictability or dynamic content variation could degrade such speculative acceleration techniques, increasing agent costs tactically.

Cite

bibtex

@article{arxiv2605_16565,
  title={ Skim: Speculative Execution for Fast and Efficient Web Agents },
  author={ Mike Wong and Kevin Hsieh and Suman Nath and Ravi Netravali },
  journal={arXiv preprint arXiv:2605.16565},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.16565}
}

Skim: Speculative Execution for Fast and Efficient Web Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​