From History to State: Constant-Context Skill Learning for LLM Agents

Source: arXiv:2605.05413 · Published 2026-05-06 · By Haoyang Xie, Xinyuan Wang, Yancheng Wang, Puda Zhao, Feng Ju

TL;DR

This paper addresses the tension between capability, cost, and privacy in LLM-based personal agents. Cloud-hosted models are capable but expose sensitive intermediate state to external APIs, while local models preserve privacy but underperform on multi-step tasks. Both paradigms also suffer from quadratically growing prompt context: every step re-reads long skill descriptions and appends the full interaction history. The paper proposes 'constant-context skill learning,' a framework that relocates reusable procedural knowledge from prompt-time context into the weights of lightweight, per-task-family LoRA adapters, while keeping the inference prompt bounded to only the current observation, a one-step context snippet, and a compact tracker-rendered 'state block' that summarizes task progress deterministically.

The core mechanism is a three-stage pipeline: (1) a deterministic task tracker compresses execution history into a structured state block (no LLM calls involved), (2) these state blocks are used to construct step-level supervised fine-tuning (SFT) targets from expert trajectories, and (3) the adapter is further refined via online GRPO-style reinforcement learning using subgoal rewards derived from the same tracker fields. The base model remains fully frozen throughout; only the LoRA adapter (~2% of parameters, ~0.5–0.7 GB on disk) is updated per task family, making it possible to add new workflows without catastrophic forgetting.

Evaluated on ALFWorld, WebShop, and SciWorld using Qwen3-4B, Qwen3-8B, and Llama-3.1-8B backbones, the method achieves competitive or state-of-the-art performance while reducing prompt tokens per turn by 2–7x relative to ReAct baselines. With Qwen3-8B SFT+RL, the system reaches 89.6% unseen success on ALFWorld, 76.8% success on WebShop, and 66.4% unseen success on SciWorld, matching or exceeding published results from methods that use multiple H100 GPUs, while training entirely on a single A100 80GB GPU.

Key findings

With Qwen3-8B SFT+RL, the method achieves 89.6% unseen success on ALFWorld (seen: 83.6%), 76.8% success on WebShop, and 66.4% unseen success on SciWorld (seen: 62.9%), per Table 1.
RL refinement over SFT-only improves performance by roughly 13–28 percentage points on ALFWorld, 15–17 points on WebShop, and 5–15 points on SciWorld across all three backbone families (Table 1).
Prompt tokens per turn with Qwen3-8B drop from 1,310 (ReAct-full) and 380 (ReAct-1step) to ~184 on ALFWorld, from 3,093/1,059 to ~488 on WebShop, and from 1,938/1,481 to ~496 on SciWorld (Table 2).
Total tokens per episode fall by ~10–12x on ALFWorld (34,199 → 3,251), ~14x on WebShop (46,531 → 3,353), and ~2–3x on SciWorld (42,280 → ~17,894) relative to ReAct-full (Table 2).
Ablation on WebShop with Qwen3-8B shows: current observation alone achieves 1.2% SR; adding one-step context raises it to 5.6%; adding the state block raises it to 23.6%; SFT training raises it to 62.2%; full SFT+RL reaches 76.8% (Table 3).
Removing the state block from the trained full method drops WebShop SR by 15.6 points (76.8% → 61.2%), confirming the model uses the state block as execution state rather than memorizing trajectories (Table 3).
Using only terminal task reward (no subgoal shaping) during RL drops WebShop SR by 4.6 points (76.8% → 72.2%); each reward component (progress, error penalty, step cost) contributes incrementally (Fig 3).
On SciWorld, the method achieves a score of 79.7 with Qwen3-8B, exceeding the previously strongest published result of 75.9 from EMPO2, while all training fits on a single A100 80GB GPU vs. multi-H100 setups used by some baselines (Table 4).

Threat model

n/a — this is not a security paper in the adversarial ML sense. The privacy motivation is architectural: the adversary is implicitly any external cloud API provider or third-party service that receives intermediate agent state during task execution (e.g., email contents, calendar entries, document fragments passed as observations). The proposed defense is structural — by keeping the base model frozen and local, and bounding the inference context to a compact deterministic state block rather than raw sensitive data, less sensitive intermediate state needs to be transmitted. However, no formal threat model is stated, no information-theoretic or empirical privacy guarantee is provided, and the paper does not analyze whether the state block itself could leak sensitive information.

Methodology — deep read

Threat model and assumptions. This is not a security paper in the adversarial ML sense, but the framing establishes a practical threat model around privacy and cost. The adversary is implicitly a cloud API provider or any external service that receives intermediate agent state (emails, calendar events, tool outputs). The key assumption is that for recurring personal workflows, a modest set of successful execution trajectories is available — either from user demonstrations or from prior successful agent runs — and that these trajectories are sufficient to train a task-family LoRA adapter. The framework assumes a fixed, known action space and a text-in/text-out environment interface. There is no adversarial perturbation of observations or rewards; the threat is data leakage and token cost, not active manipulation.

Data: provenance, size, splits, and preprocessing. Three benchmarks are used. ALFWorld provides household navigation and manipulation tasks; evaluation uses 140 seen and 134 unseen games. WebShop is a text-mode product search and purchase benchmark; evaluation uses 500 test goals from a 1,000-product setting. SciWorld provides procedural science-lab tasks; evaluation uses 194 seen and 211 unseen variants following the ETO split structure. Training trajectories per task family are drawn from successful expert executions (exact counts deferred to Appendix 7.1 which is truncated). Preprocessing involves replaying each successful trajectory through the deterministic tracker to produce step-level (input, action) pairs; no LLM summarization is used during this stage. The inputs are formatted as Format(g, o_t, (o_{t-1}, a_{t-1}), b_t) where b_t is the tracker-rendered state block. The expert action a*_t is the target output — raw executable environment action text, not a reasoning trace.

Architecture and novel components. The base model (Qwen3-8B, Qwen3-4B, or Llama-3.1-8B) is kept entirely frozen. Per task family, a LoRA adapter (rank r << min(d_in, d_out)) is trained, updating approximately 2% of parameters and occupying 0.5–0.7 GB on disk. The LoRA parameterization is standard: W_k = W_0 + (α/r) * B_LoRA * A_LoRA. The novel component is the deterministic task tracker: a lightweight, hand-coded algorithmic module (not an LLM) that maintains a structured state m_t, updated by parsing observations and actions via environment-specific rules (Algorithm 1). The tracker renders a compact state block b_t containing only the fields relevant to the current phase/subgoal — e.g., in ALFWorld: target object, holding status, destination receptacle, checked locations; in WebShop: current query, inspected product, selected options, remaining options, purchase readiness; in SciWorld: current phase, selected entity, inventory, answer room, completed subgoals. This is the key architectural novelty: the state block is deterministic, reproducible, and costs no inference-time LLM calls, yet it provides the policy with sufficient decision-relevant context to avoid needing the full history.

Training regime. Stage 1 (SFT): Standard next-token prediction on step-level (x_t, a*_t) pairs with cross-entropy loss (Eq. 5). Stage 2 (RL): GRPO-style group-normalized policy gradient (Eq. 8–9). For each task instance, K=4 rollouts are sampled from the current policy. Step-level discounted returns G^(i)_t are computed and normalized across all steps from all K rollouts of the same task instance to obtain advantages A^(i)_t. The RL loss (Eq. 9) combines a policy gradient term with a KL-like penalty to the frozen SFT adapter (not the original base model), keeping RL close to the procedural behavior learned during SFT. Reward r_t = r_env_t + r_prog_t − r_err_t: r_env is the benchmark success/score signal; r_prog rewards subgoal advancement; r_err penalizes invalid, repetitive, or phase-inconsistent actions plus a step cost. The reward rules were drafted offline using GPT-5.5 prompted with the task specification and state-block schema (Appendix 7.5), then made fully deterministic — no LLM judge is called at rollout time. All training runs fit on a single NVIDIA A100 80GB GPU. Temperature 0.4, top-p 0.95 for stochastic inference; deterministic decoding for token-cost accounting. Three inference seeds used for mean±std reporting. Full hyperparameters are in Appendix 7.1 (truncated in the provided text).

Evaluation protocol. Performance is reported as mean±std over three inference seeds (Table 1). Baselines serve two purposes: (a) controlled token-cost comparison against ReAct-1step (latest obs+action only) and ReAct-full (full truncated history) run under identical tokenizer, split, and action parser; (b) performance comparison against published state-of-the-art agent-training results copied from original papers (Table 4), with the caveat that these are not fully controlled (different training data, objectives, implementations). Ablation is conducted on WebShop with Qwen3-8B across prompt design variants (no training) and trained leave-one-out variants (Table 3), plus reward component decomposition (Fig 3). A unified single-module variant across all ALFWorld families is also evaluated as a modularity check (74.3%/70.9% SFT, 86.4%/88.1% RL). Data efficiency on SFT is studied in Appendix 7.3 (truncated).

Concrete end-to-end example (ALFWorld). A task instruction g specifies 'put a clean lettuce in countertop.' The tracker initializes m_0 with target=lettuce, destination=countertop, phase=search. At step t=1, observation o_1 = 'You are in the kitchen. You see: fridge, countertop, sink.' The tracker renders b_1: {target: lettuce, holding: False, destination: countertop, checked: [], phase: search}. The input x_1 = Format(g, o_1, (∅,∅), b_1) — no history, just current obs + state block. The LoRA-augmented frozen model generates a_1 = 'open fridge.' At t=2, o_2 = 'You open the fridge. You see: lettuce, tomato.' Tracker updates: checked=[fridge], target_visible=True. b_2 now reflects visibility. The model generates a_2 = 'take lettuce from fridge.' The state block b_t provides progress signal (e.g., holding=True after pick-up) without the agent needing to re-read the full history. During RL, r_prog fires when the agent correctly advances the subgoal (e.g., picks up the target), and r_err fires if the agent issues a 'go to sink' when it should be heading to the countertop.

Reproducibility. The paper reports all training fits on a single A100 80GB. Hyperparameters, decoding settings, and compute details are in Appendix 7.1 (truncated in the provided text). Code release, frozen weights, and dataset availability are not explicitly mentioned in the provided text — this is unclear from the truncated full text.

Technical innovations

Deterministic task tracker as a non-LLM state-compression module: unlike prior memory-augmented agents (e.g., MemGen, ReAct) that either replay full histories or use LLM summarizers, the tracker uses environment-specific parsing rules to produce a reproducible, constant-size state block at zero inference-time LLM cost.
Context-to-weights formulation: recurring procedural knowledge is moved from prompt-time text (as in ReAct-style skill prompts and retrieved memory agents) into the weights of per-task-family LoRA adapters, bounding prompt length independently of episode length (Eq. 2).
Tracker-aligned subgoal reward design: RL rewards are defined over the same state-block fields rendered by the tracker, creating a tight alignment between the policy's observation interface and its reward signal, avoiding the need for an LLM judge at rollout time (contrasting with methods like ETO or GiGPO which may use preference-based or environment-terminal signals).
Modular SFT+RL training pipeline with frozen base: each task family trains an independent LoRA adapter on top of a permanently frozen backbone, enabling addition of new skills without catastrophic forgetting and keeping per-adapter cost to ~0.5–0.7 GB, viable for local deployment.
Offline LLM-assisted reward rule generation followed by deterministic rollout execution: GPT-5.5 is used once offline to draft deterministic reward rules from the state-block schema; no LLM judge is queried during the RL training loop, making RL training cost-stable and reproducible.

Datasets

ALFWorld — 140 seen + 134 unseen evaluation games — public benchmark (Shridhar et al.)
WebShop — 500 test goals from 1,000-product text-mode setting — public benchmark (Yao et al.)
SciWorld — 194 seen + 211 unseen variants, ETO split — public benchmark (Wang et al.)

Baselines vs proposed

ReAct-1step (Qwen3-8B, ALFWorld unseen): prompt tokens/turn = 358.6, SR not directly reported in Table 2 vs proposed: prompt tokens/turn = 179.9
ReAct-full (Qwen3-8B, ALFWorld unseen): prompt tokens/turn = 1,320.4, total tokens/episode = 34,145 vs proposed SFT+RL: 179.9 tokens/turn, 2,851 total tokens/episode
ReAct-full (Qwen3-8B, WebShop): prompt tokens/turn = 3,092.7, total tokens/episode = 46,531 vs proposed SFT+RL: 488.0 tokens/turn, 3,353 total tokens/episode
ETO (ALFWorld): SR = 72.4% vs proposed (Qwen3-8B): SR = 89.6%
GiGPO (ALFWorld): SR = 90.8% vs proposed (Qwen3-8B): SR = 89.6%
GiGPO (WebShop): SR = 72.8% vs proposed (Qwen3-8B): SR = 76.8%
EMPO2 (WebShop): SR = 76.9% vs proposed (Qwen3-8B): SR = 76.8%
EMPO2 (SciWorld): score = 75.9 vs proposed (Qwen3-8B): score = 79.7
MemGen (ALFWorld): SR = 90.6% vs proposed (Qwen3-8B): SR = 89.6%
Early Exp. (ALFWorld): SR = 85.9%, WebShop SR = 62.2%, SciWorld = 68.0 vs proposed (Qwen3-8B): 89.6%, 76.8%, 79.7 respectively
WebShop ablation — current obs only (no training): SR = 1.2% vs proposed SFT+RL: SR = 76.8% (Table 3)
WebShop ablation — w/o state block (trained, RL): SR = 61.2% vs proposed full method: SR = 76.8% (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.05413.

Fig 1

Fig 1: Context-to-weights skill learning pipeline.

Fig 2

Fig 2 (page 4).

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

The deterministic tracker must be hand-engineered per task family, requiring environment-specific parsing rules; this is a significant engineering cost that limits out-of-the-box generalization to novel or open-ended environments where subgoal structure is not pre-defined.
Evaluation is restricted to three text-mode benchmarks (ALFWorld, WebShop, SciWorld); there is no evaluation on real browser automation, GUI agents, or code execution environments where observations are noisier and task structure is less regular.
The comparison with published baselines (Table 4) is explicitly noted as uncontrolled — methods differ in training data quantity, trajectory construction, model checkpoints, and implementation details, making direct attribution of performance differences unreliable.
Privacy claims (the core motivation) are not empirically evaluated; there is no experiment demonstrating that the bounded-context interface actually prevents sensitive data leakage compared to cloud API approaches, nor any analysis of what information might leak through the state block itself.
The RL reward rules are drafted with GPT-5.5 offline; the paper does not report sensitivity analysis on reward design choices or how much human iteration was required, making the reward engineering burden unclear and potentially non-trivial for new task families.
Code release, model weights, and tracker implementations are not explicitly confirmed as publicly available in the provided text, limiting immediate reproducibility despite the detailed algorithmic descriptions.
The modular per-family adapter design scales linearly in the number of task families (0.5–0.7 GB per adapter); for personal agents with many heterogeneous workflows, this could become impractical on resource-constrained local devices, and this scaling regime is not studied.

Open questions / follow-ons

Can the deterministic tracker be replaced or bootstrapped with a learned or LLM-assisted component for task families where subgoal structure is unknown a priori, without sacrificing the reproducibility and zero-inference-cost properties that make it practical?
How does constant-context skill learning behave under distribution shift at inference time — e.g., when a personal workflow changes (new UI layout, new product categories, new lab protocols) — and can the LoRA adapter be efficiently updated incrementally from a small number of new demonstrations?
The privacy claim motivating the work is not empirically evaluated: does the compact state block representation actually reduce the sensitivity of transmitted information compared to raw observation histories, and can formal differential privacy or information-theoretic bounds be established?
What is the minimum number of successful expert trajectories needed to train a functional SFT module across different task families (Appendix 7.3 addresses this for WebShop but is truncated), and does the answer change significantly for real-world workflows that are noisier and less structured than benchmark tasks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper is most directly relevant as a capability benchmark for automated web agents. The WebShop results (76.8% success rate on a text-mode product search and purchase task) represent a concrete, measurable capability level for a local 8B-parameter model that operates under a bounded-context interface. The fact that this is achieved with ~488 prompt tokens per turn and ~3,353 total tokens per episode, running entirely on a single A100 GPU, is a meaningful signal: the cost and complexity bar for deploying capable web automation agents continues to fall. The method's LoRA adapter approach means that a task-family module for, e.g., 'navigate and complete a purchase flow' or 'solve a multi-step form' could be trained from a modest set of demonstrations and deployed locally without cloud API dependencies — making behavioral fingerprinting based on API provider characteristics less reliable as a detection signal.

The deterministic state tracker is also worth noting from a detection perspective. Unlike history-replaying ReAct agents that produce verbose, predictable prompt structures, or memory-augmented agents that issue retrieval calls, this architecture produces short, structurally uniform inputs at every step. This changes the observable interaction pattern: fewer tokens per action, more consistent action latency (no growing context to process), and no retrieval side-channels. Bot-defense systems that rely on timing anomalies, unusually short dwell times correlated with long LLM inference, or characteristic verbose reasoning traces in network traffic would need to adapt. Conversely, the rigid deterministic state block structure — and the fact that tracker rules are hand-engineered per task family — may introduce detectable behavioral regularities specific to this architecture, such as stereotyped action sequences within a subgoal phase that differ from human browsing variance.

Cite

bibtex

@article{arxiv2605_05413,
  title={ From History to State: Constant-Context Skill Learning for LLM Agents },
  author={ Haoyang Xie and Xinyuan Wang and Yancheng Wang and Puda Zhao and Feng Ju },
  journal={arXiv preprint arXiv:2605.05413},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.05413}
}

From History to State: Constant-Context Skill Learning for LLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​