Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Source: arXiv:2606.04391 · Published 2026-06-03 · By Jiaxi Li, Ke Deng, Yun Wang, Jingyuan Huang, Yucheng Shi, Qiaoyu Tan et al.

TL;DR

This paper addresses a key limitation in online skill learning for web automation agents: traditional approaches reuse skills only at the task level based on initial instructions and keep them fixed throughout execution. This static method ignores how the evolving webpage state impacts the relevance and applicability of particular skills as the agent executes multi-step tasks. The authors propose State-Grounded Dynamic Retrieval (SGDR), which enables dynamic, stepwise skill reuse conditioned on both the overall task goal and the current webpage state. SGDR extracts reusable intermediate-level procedural skills using a sliding-window extraction from successful trajectories and represents each skill as a dual text–code pair to support retrieval and execution. During task execution, SGDR dynamically retrieves and reranks skills relevant to the current execution state and injects them adaptively at each step. Experimentally, SGDR consistently outperforms strong baselines on the WebArena benchmark across five domains and two backbone LLMs, improving success rates by 10.6% and 10.0% relative to the strongest baseline for GPT-4.1 and QWEN3-4B, respectively, while also requiring fewer steps on average. This demonstrates that adaptive state-grounded skill retrieval better aligns reusable procedural knowledge to evolving web environments than prior static task-level methods.

Key findings

SGDR achieves an average success rate of 37.5% with GPT-4.1 and 24.3% with QWEN3-4B on the WebArena benchmark, surpassing the strongest baseline (CER) by 3.6 and 2.2 percentage points respectively, representing relative gains of 10.6% and 10.0%.
SGDR reduces mean step count to complete tasks: 4.8 steps with GPT-4.1 vs 5.2–6.4 for baselines, and achieves 11.1% fewer steps than Vanilla and 13.8% fewer than CER with QWEN3-4B.
Combining task goal and current state in skill retrieval (with weighting α=0.5) outperforms retrieval conditioned on task or state alone, improving success rates by approximately 2–6 percentage points depending on domain.
MMR reranking to diversify retrieved skills improves over relevance-only top-M retrieval by around 1.5–3.5 percentage points, showing redundancy reduction boosts reuse efficiency.
Sliding-window skill extraction at intermediate granularity outperforms full-trajectory and single-action skill extraction on success rates by multiple points, balancing reusability and procedural completeness.
SGDR maintains stronger cumulative success rates across sequential tasks than baselines, especially on domains like Admin and Reddit, demonstrating effective online adaptation.
Gitlab domain shows smaller gains, likely due to task characteristics requiring whole-task contextual knowledge not captured by local skills.

Threat model

The work assumes a benign deployment scenario where an agent sequentially solves web tasks and improves by accumulating skills from prior successful executions. There is no explicit adversary threatening the agent’s internal state, retrieval process, or skill integrity. The retrieval mechanism does not consider adversarial manipulation of webpage states or malicious skill injection. Thus, the threat model is effectively absent or minimal, focusing instead on optimizing adaptive skill reuse under evolving legitimate webpage states.

Methodology — deep read

The authors formalize the problem as online skill learning for language-based web agents, which solve a sequential stream of tasks. After each task, the agent extracts reusable skills from the completed trajectory to augment its skill library, then reuses only previous tasks' skills on future tasks without ground-truth success signal available during execution. Each task consists of multiple interaction steps with webpage states and actions recorded as trajectories.

SGDR’s core innovation first segments successful trajectories using sliding windows of varying length (from 2 to 5 steps) to identify candidate reusable sub-procedures. Each candidate is passed to an LLM to assess if it is reusable and to generate a paired skill consisting of (1) a natural-language description and (2) executable code that implements the procedure. The skill code can be invoked as a single callable action during future execution. The skill is verified by replacing the trajectory segment with the skill call and executing it to ensure task success is preserved, only verified skills are added to the library.

During new task execution at each step, the agent generates a compact summary embedding of the current webpage state using an LLM, which along with the task instruction embedding forms a combined query. Skill descriptions in the library are also embedded. Retrieval computes a weighted cosine similarity score combining task and state relevance. The top-M retrieval candidates are then reranked using Maximal Marginal Relevance (MMR) to reduce redundancy and encourage diversity, balancing relevance and novelty with tunable hyperparameter λ. The top-5 reranked skills are injected into the agent's context as callable procedures for that step only, enabling flexible dynamic reuse grounded in the evolving web state.

Experiments use WebArena benchmark with five website domains and two backbone LLMs (GPT-4.1 and QWEN3-4B). Tasks are streamed sequentially by domain with separate per-domain skill libraries. Baselines include Vanilla (no skill reuse) and three recent online skill learning methods (AWM, ASI, CER) which only retrieve skills once at task start. Evaluation metrics are binary task success rate and average steps to completion. Ablations analyze the impact of retrieval weighting α, MMR relevance-diversity parameter λ, and skill extraction granularity.

Training uses the same backbone LLM for all LLM-based components: skill induction, summarization, action planning, and evaluation. Skills are extracted only from trajectories judged successful by an evaluator model (trained or oracle) since ground-truth is unavailable during training. The sliding-window extractor enumerates overlapping sub-segments and verifies them using the evaluator. Retrieval embeddings utilize text encoders (not specified in detail). Details on hyperparameters, LLM prompts, exact retrieval M and L values, and seed settings are provided in appendix.

An example end-to-end: A completed task trajectory is segmented into windows of 2–5 observation-action steps; each is passed to an LLM to generate a text description and code. Verified reusable skills (e.g., "open account settings") are stored. On a subsequent related task, at each intermediate webpage state the agent generates a summary embedding, combines with task instruction embedding, and retrieves a diverse set of relevant skills. The retrieved skills are injected as callable code snippets to supplement the agent's primitive action set, thus enabling efficient multi-step web automation tailored dynamically to the evolving webpage context.

Technical innovations

Sliding-window based skill extraction from evaluator-verified sub-trajectories enables intermediate-granularity reusable procedural skills rather than whole-task workflows or single actions.
Dual text–code skill representation facilitates both natural language retrieval and executable action injection, integrating semantic intent with practical execution.
State-grounded dynamic retrieval scores skills by combining similarity to both the current task instruction and evolving webpage state embeddings, enabling context-adaptive skill reuse.
Maximal Marginal Relevance (MMR) reranking balances relevance and diversity to reduce procedural redundancy among retrieved skills, improving efficiency in skill selection across steps.

Datasets

WebArena — several hundred multi-step web tasks across 5 domains (Shopping, Admin, Reddit, Gitlab, Map) — public benchmark from Zhou et al. 2024

Baselines vs proposed

Vanilla (no skills): GPT-4.1 success rate = 28.3% vs SGDR = 37.5%
Agent Workflow Memory (AWM): GPT-4.1 success rate = 27.8% vs SGDR = 37.5%
Agent Skill Induction (ASI): GPT-4.1 success rate = 33.0% vs SGDR = 37.5%
Contextual Experience Replay (CER): GPT-4.1 success rate = 33.9% vs SGDR = 37.5%
Vanilla: QWEN3-4B success rate = 16.5% vs SGDR = 24.3%
AWM: QWEN3-4B success rate = 15.7% vs SGDR = 24.3%
ASI: QWEN3-4B success rate = 20.8% vs SGDR = 24.3%
CER: QWEN3-4B success rate = 22.1% vs SGDR = 24.3%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.04391.

Fig 3

Fig 3: Overview of our method SGDR. Completed trajectories are segmented with sliding windows to induce

Limitations

Evaluations are restricted to WebArena with five domains that may not cover the full diversity of real-world websites and interaction patterns.
The applied backbone LLMs (GPT-4.1 and QWEN3-4B) are proprietary or cutting-edge models, which may limit generalizability or replication on smaller models.
SGDR focuses on reusable sub-procedures local to small trajectory windows, potentially limiting effectiveness on tasks requiring holistic context or long-range dependencies (e.g., Gitlab domain).
The skill extraction and retrieval rely on LLM-based embedding and summarization without detailed ablation of embedding architectures or robustness under noisy webpage states.
No adversarial evaluation or robustness testing against malicious or non-standard webpage structures were reported.
The approach requires an evaluator model to judge task success post-completion, which may not always be practical or available in deployment scenarios.

Open questions / follow-ons

How well does SGDR generalize to broader and more diverse real-world websites beyond WebArena’s domains?
Can the approach be extended to handle multi-domain or cross-website tasks requiring skill transfer or composition?
How robust is the skill extraction and retrieval process when webpage states include noisy, dynamic, or adversarial DOM changes?
Could fully end-to-end learning replace the evaluator or embedding components to jointly optimize skill induction and retrieval?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, SGDR’s key insight is that dynamically grounding reusable procedural skills on the evolving state of a web agent’s environment increases automation efficiency and success. This suggests that bot detection strategies relying on static behavioral signatures may become less effective if agents adapt continually and reuse skills contextually. Defenders might anticipate that advanced web agents will exhibit state-aware dynamic interaction patterns rather than fixed scripted behaviors.

From a CAPTCHA perspective, understanding these adaptive skill reuse mechanisms can guide the design of challenges that disrupt intermediate states critical for skill applicability, or detect inconsistencies in dynamically selected procedural patterns. Incorporating state-grounded behavioral analysis into bot detection could enhance differentiation between human and sophisticated automated agents. SGDR highlights the increasing sophistication of web agents adapting flexibly to web context evolution, raising the bar for bot-defense techniques that rely on identifying static skill invocation sequences.

Cite

bibtex

@article{arxiv2606_04391,
  title={ Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval },
  author={ Jiaxi Li and Ke Deng and Yun Wang and Jingyuan Huang and Yucheng Shi and Qiaoyu Tan and Jin Lu and Ninghao Liu },
  journal={arXiv preprint arXiv:2606.04391},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.04391}
}

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​