SkillOS: Learning Skill Curation for Self-Evolving Agents

Source: arXiv:2605.06614 · Published 2026-05-07 · By Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang, Bhavana Dalvi Mishra et al.

TL;DR

SkillOS addresses a fundamental limitation of LLM-based agents deployed in streaming settings: they solve tasks in isolation and discard all experience afterward. The paper frames this as a skill curation problem — how to distill, store, update, and delete reusable procedural knowledge (skills) from past trajectories so that a frozen agent executor can perform better on future related tasks. The core bottleneck identified is that prior approaches either require manual skill authoring (Anthropic's SKILL.md repository), use heuristic rules for skill operations without downstream feedback, or only optimize over short task horizons where complex operations like skill deletion and update receive almost no learning signal.

SkillOS proposes a modular two-agent architecture: a frozen executor (Qwen3-8B during training, swappable at test time) that retrieves and applies skills via BM25, and a trainable skill curator (also Qwen3-8B) that performs insert/update/delete operations on an external Markdown-based SkillRepo after each task. The key training recipe involves grouping related tasks by annotated skill-relevant attributes so that curation decisions made on early tasks in a group are evaluated by their downstream impact on later tasks in the same group — creating a longer-horizon credit assignment signal. A composite reward combining task success, function call validity, skill content quality (judged by Qwen3-32B), and repository compression is used to train the curator via GRPO.

Across ALFWorld, WebShop, AIME24, AIME25, and GPQA-Diamond, SkillOS consistently outperforms memory-free and memory-based baselines (ReasoningBank, MemP) in both success rate and interaction efficiency. The trained 8B curator generalizes to unseen executors including Qwen3-32B and Gemini-2.5-Pro at test time, and notably outperforms using Gemini-2.5-Pro directly as the curator, suggesting that executor-grounded RL training produces more compatible skill curation behavior than raw model scale.

Key findings

On ALFWorld with Qwen3-8B executor, SkillOS achieves 61.2% average success rate vs. 55.7% for the strongest baseline ReasoningBank — a +9.8% relative improvement — while reducing average interaction steps from 21.1 (No Memory) to 18.9.
On ALFWorld with Gemini-2.5-Pro executor, SkillOS (Qwen3-8B curator) achieves 80.2% SR vs. 66.4% No Memory baseline, a +13.8pp absolute gain; SkillOS-gemini (Gemini-2.5-Pro curator) achieves 79.3%, meaning the smaller RL-trained 8B curator outperforms the frontier model curator.
On WebShop with Qwen3-32B executor, SkillOS achieves a score of 49.2 and SR of 16.5% vs. 41.5/12.2% for No Memory, with fewer steps (15.9 vs. 17.0), showing simultaneous effectiveness and efficiency gains.
On single-turn reasoning tasks (AIME24+AIME25+GPQA-Diamond) with Gemini-2.5-Pro executor, SkillOS reaches 88.6% average accuracy vs. 81.8% No Memory and 83.5% ReasoningBank, a +6.8pp gain over the best memory baseline.
Ablation on ALFWorld shows that removing grouped task streams (w/o grouping) is the single largest performance drop: SR falls from 61.2% to 57.3% and steps increase from 18.9 to 20.6, confirming that long-horizon skill dependency modeling is the most critical training design decision.
Removing the content quality reward (w/o r_cnt) drops SR from 61.2% to 58.6%; removing the compression reward (w/o r_comp) drops SR to 60.0% — both auxiliary rewards contribute independently.
Cross-task generalization (Fig. 3) shows that reasoning-trained curators transfer positively to agentic tasks (+3.6pp to +7.4pp), while WebShop/ALFWorld-trained curators transfer less strongly to other domains, suggesting that abstract reasoning skills are more domain-agnostic than environment-specific procedural skills.
Training behavioral analysis (Fig. 4) shows that insert operations dominate early training while update operations increase over time, indicating the curator evolves from repository population to skill refinement — a qualitatively sensible progression that was not explicitly supervised.

Methodology — deep read

Threat model and problem framing: There is no adversarial threat model in the security sense. The problem is instead framed as a learning problem under delayed and indirect feedback: a skill curator must decide how to update a shared skill repository after each task, but the quality of a curation decision is only revealed through the executor's performance on future related tasks. The curator has access to the executor's trajectory, a self-judged correctness signal, and the retrieved subset of currently relevant skills. The executor is frozen and cannot be fine-tuned — all learning is concentrated in the curator.

Data provenance and preprocessing: For agentic tasks, training and test splits from ALFWorld (Shridhar et al., 2021) and WebShop (Yao et al., 2022) are used. For reasoning, 33,000 samples are randomly drawn from DeepMath-103k (He et al., 2026a). Task grouping is a critical preprocessing step: each task instance is annotated with skill-relevant tags (e.g., 'algebra', 'Fourier transformation' for math; task type labels for ALFWorld) using Gemini-2.5-Pro as a tagger. Tasks are then partitioned into groups G_m where members share high tag similarity, serving as a proxy for skill dependency. The detailed grouping algorithm is in Appendix B.2 (not reproduced in the truncated text). Group size and the exact similarity threshold are not explicitly stated in the main paper.

Architecture: SkillOS uses a modular two-agent design. The agent executor π_L (frozen Qwen3-8B during training) takes a task, current environment observation, and a BM25-retrieved subset of the SkillRepo as input, and produces actions using the ReAct framework for agentic tasks and chain-of-thought for reasoning tasks. Skills are stored as Markdown files with YAML frontmatter (name, description/trigger condition) and freeform Markdown body (workflows, constraints, heuristics). The skill curator π_S (trainable Qwen3-8B) takes as input the executor's trajectory ξ_t, a binary self-judged correctness indicator 1_{ξ_t}, and retrieved skills S̃_t, then outputs a sequence of structured function calls from {insert_skill, update_skill, delete_skill}. Each function call is executed against the SkillRepo file system. The curator is the sole trainable component; its outputs are discrete tool-call sequences over a variable number of skill operations per task.

Training regime: Policy optimization uses GRPO (Shao et al., 2024) with group size N=8 rollouts per task group. Within each rollout, the executor runs sequentially through all tasks in the group, and the curator updates the SkillRepo after each task — so different rollouts develop different SkillRepo histories. The composite reward r = r_task + λ_f·r_fc + λ_u·r_cnt + λ_c·r_comp is computed over the full group trajectory. Reward weights are: λ_f=1.0 (function call validity), λ_u=0.1 (content quality from Qwen3-32B judge), λ_c=0.05 (compression). r_task is defined as average success over tasks 2..|G| (excluding the first, which has no prior curation). The compression reward is 1 - |S_i|/|χ_i| (ratio of repository size to curator input context size), penalizing verbatim trajectory storage. KL regularization is explicitly dropped from GRPO to encourage exploration. Learning rate is 1×10^-6, batch size 32. Training runs on 16 H100 GPUs using the verl framework: ~3 days for ALFWorld, ~2.5 days for reasoning tasks, ~5 days for WebShop. Seed strategy and number of training steps are not specified in the main text.

Evaluation protocol: Three runs are conducted per configuration; mean and standard deviation are reported. Effectiveness metrics: success rate (SR) for agentic tasks, accuracy for reasoning. Efficiency metrics: interaction steps per task (agentic) and tokens per problem (reasoning). Baselines include: (i) No Memory (frozen executor, no skill repo), (ii) ReasoningBank (Ouyang et al., 2026 — distills reusable insights from trajectories), (iii) MemP (Fang et al., 2025b — heuristic procedural memory management), (iv) SkillOS-base (same architecture, curator not RL-trained), (v) SkillOS-gemini (Gemini-2.5-Pro used directly as curator without RL). Generalization is tested by swapping in Qwen3-32B and Gemini-2.5-Pro as executors at test time after training with Qwen3-8B executor. Cross-task generalization uses curators trained on one domain and evaluated on another (Fig. 3). Ablations on ALFWorld test removing r_cnt, r_comp, and grouped training individually (Table 3). No held-out adversarial evaluation, no distribution shift testing beyond cross-domain transfer.

Concrete example end-to-end: For ALFWorld's 'Clean' task type (27 test instances), the curator starts with an empty SkillRepo. After the executor (Qwen3-8B) attempts task x_1 in a group — say, cleaning a mug — and produces a trajectory with a self-judged correctness label, the curator calls insert_skill('clean_object_workflow.md', content='# Workflow\n1. Locate sink...') and update_skill on any retrieved related skills. The updated SkillRepo is indexed; BM25 retrieves the new skill when task x_2 (cleaning a plate) arrives. The executor now conditions on the skill content and avoids redundant environment exploration. The reward for this group is: r_task = average success of x_2 through x_|G|; r_fc = fraction of valid function calls; r_cnt = Qwen3-32B's quality score; r_comp = compression ratio. GRPO computes advantage relative to the mean of 8 parallel rollouts and updates π_S. SkillOS achieves 54.3% SR on Clean vs. 33.3% No Memory and 49.4% ReasoningBank (Qwen3-8B executor), confirming the benefit of learned curation on multi-step procedural tasks.

Technical innovations

Grouped task stream construction for long-horizon skill curation: tasks are clustered by LLM-annotated skill-relevant attributes so that within-group sequential execution creates a credit assignment chain from curation decisions to downstream task outcomes — unlike prior RL-for-skills work (Wang et al., 2025a; Ye et al., 2026) that only optimizes over short local horizons.
Composite reward that decomposes delayed executor feedback into four separable signals (task outcome, function call validity, LLM-judged content quality, compression ratio), enabling GRPO to learn both when and how to insert/update/delete skills rather than collapsing all signal into a single sparse outcome.
Executor-frozen, curator-trainable modular architecture that enables the trained skill curator to transfer across executor backbones (Qwen3-8B, Qwen3-32B, Gemini-2.5-Pro) without retraining, addressing the curator-executor mismatch that occurs when a stronger model curates skills misaligned with a weaker executor's behavior.
Explicit KL term removal in GRPO for the skill curator, justified as encouraging exploration of diverse curation strategies in a space with high combinatorial action complexity (sequences of structured file I/O calls).
Markdown-file skill representation with YAML frontmatter (trigger condition) and freeform body that supports structured evolution — the paper shows qualitatively that skills develop emergent sections (e.g., 'When NOT to Use', 'Prerequisite Constraints') beyond the initially suggested template through RL training.

Datasets

ALFWorld — training + test split from ALFRED benchmark (text-based household tasks, 140 test instances across 6 subtypes) — public (Shridhar et al., 2021)
WebShop — training + test split (online shopping simulation) — public (Yao et al., 2022)
AIME24 — standard competition math benchmark — public
AIME25 — standard competition math benchmark — public
GPQA-Diamond — graduate-level science QA (Rein et al., 2024) — public
DeepMath-103k — 33,000 randomly sampled math reasoning problems used for curator training — public (He et al., 2026a)

Baselines vs proposed

No Memory (Qwen3-8B executor, ALFWorld): SR = 47.9%, Steps = 21.1 vs. SkillOS: SR = 61.2%, Steps = 18.9
ReasoningBank (Qwen3-8B executor, ALFWorld): SR = 55.7%, Steps = 20.1 vs. SkillOS: SR = 61.2%, Steps = 18.9
MemP (Qwen3-8B executor, ALFWorld): SR = 49.7%, Steps = 21.0 vs. SkillOS: SR = 61.2%, Steps = 18.9
SkillOS-base (Qwen3-8B executor, ALFWorld): SR = 53.1%, Steps = 20.4 vs. SkillOS: SR = 61.2%, Steps = 18.9
SkillOS-gemini (Gemini-2.5-Pro curator, Qwen3-8B executor, ALFWorld): SR = 50.7%, Steps = 20.8 vs. SkillOS (Qwen3-8B curator): SR = 61.2%, Steps = 18.9
No Memory (Gemini-2.5-Pro executor, ALFWorld): SR = 66.4%, Steps = 17.7 vs. SkillOS (Qwen3-8B curator): SR = 80.2%, Steps = 14.8
No Memory (Qwen3-8B executor, WebShop): Score = 33.3, SR = 9.8%, Steps = 20.3 vs. SkillOS: Score = 40.6, SR = 16.5%, Steps = 19.4
No Memory (Qwen3-32B executor, WebShop): Score = 41.5, SR = 12.2%, Steps = 17.0 vs. SkillOS: Score = 49.2, SR = 16.5%, Steps = 15.9
No Memory (Gemini-2.5-Pro executor, Reasoning): Avg. Acc = 81.8% vs. SkillOS: Avg. Acc = 88.6%
ReasoningBank (Gemini-2.5-Pro executor, Reasoning): Avg. Acc = 83.5% vs. SkillOS: Avg. Acc = 88.6%
SkillOS-GRPO w/o grouping (ALFWorld ablation): SR = 57.3%, Steps = 20.6 vs. SkillOS full: SR = 61.2%, Steps = 18.9
SkillOS-GRPO w/o r_cnt (ALFWorld ablation): SR = 58.6%, Steps = 20.1 vs. SkillOS full: SR = 61.2%, Steps = 18.9

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06614.

Fig 1

Fig 1: | SkillOS pairs a frozen Agent Executor with a trainable Skill Curator. The executor retrieves

Fig 2

Fig 2: | SkillOS training pipeline. Each training step samples a group of related tasks and

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Task grouping relies on Gemini-2.5-Pro for tag annotation, introducing a dependency on a closed frontier model during data preprocessing — the quality of groups (and thus training signal density) is directly coupled to the tagger's reliability, which is not ablated.
Content quality reward uses Qwen3-32B as a judge during training, adding a non-trivial compute dependency and a potential reward hacking surface where the curator learns to produce text that scores well with the judge but is not actually useful to the executor.
Compression reward is defined as 1 - |S_i|/|χ_i| (repository tokens / context tokens), which may incentivize pathological brevity rather than genuine abstraction; the paper does not show cases where over-compression harmed performance.
The grouped training instances and their construction algorithm are described at a high level only in the main paper (details deferred to Appendix B.2), making the core data pipeline difficult to reproduce without the appendix.
All training uses Qwen3-8B as the executor; while the paper tests generalization to Qwen3-32B and Gemini-2.5-Pro at inference time, the skill curator was never trained with a mismatched executor, so the full extent of the curator-executor mismatch phenomenon is not characterized.
No adversarial or distribution-shifted evaluation: all test tasks come from the same benchmarks as training data (same environment dynamics, same task type distributions), so robustness to novel task categories or adversarially constructed task streams is unknown.
Training compute cost (3–5 days on 16 H100s per benchmark) is substantial for an 8B curator; the paper does not report whether a smaller curator model (e.g., 3B) or fewer training steps would suffice, limiting accessibility for practitioners without large GPU clusters.

Open questions / follow-ons

How does skill quality degrade when the task stream contains distributional shift or adversarially ordered tasks that violate the grouped dependency assumption — does the curator learn to insert conflicting skills or fail to delete stale ones?
The compression reward discourages verbatim copying but does not directly enforce factual accuracy of distilled skills; can the curator hallucinate plausible-sounding but incorrect procedural knowledge that passes the quality judge while degrading executor performance on edge cases?
Reasoning-trained curators transfer better cross-domain than environment-specific curators (Fig. 3) — is this because abstract skills are inherently more transferable, or because the reasoning task grouping produces more generalizable RL training signal? Disentangling these would inform curriculum design.
The paper shows that the curator-executor mismatch exists (Gemini-2.5-Pro curator underperforms 8B RL-trained curator when executor is Qwen3-8B), but does not characterize whether joint training of curator and executor would collapse this gap or introduce new instabilities.

Why it matters for bot defense

For bot-defense engineers, SkillOS is not directly applicable as a defense mechanism, but it is highly relevant as a capability model for understanding what sophisticated bot agents could look like in the near future. A SkillOS-style bot would accumulate reusable procedural skills across sessions — learning site-specific navigation patterns, CAPTCHA bypass heuristics, behavioral mimicry strategies — and refine them over time rather than re-exploring from scratch. The efficiency gains (−6% interaction steps) are particularly notable: a bot with a curated skill library would be harder to detect via interaction-count anomaly signals, since it converges on correct behavior faster and with less exploratory noise. The cross-executor generalization finding is also relevant: a single shared skill repository could be paired with different underlying LLM backends, making attribution and fingerprinting harder.

From a defensive standpoint, the paper implicitly suggests that task-grouping structure in user behavior logs could be a detection signal — a bot running SkillOS-style curation would exhibit correlated behavioral patterns across sessions that share skill-relevant attributes, potentially detectable as non-independent session clusters. Additionally, the compression reward's effect — pushing toward concise, abstracted skill representations rather than raw trajectory replay — means bot behavior would become increasingly stereotyped and procedurally smooth over time, which could be a detectable signature if baseline human variance models are calibrated appropriately. Bot-defense teams should also note that the 8B curator model is trainable on commodity hardware within days, meaning the barrier to deploying self-evolving agents against web properties is lower than it might appear from frontier-model-centric threat modeling.

Cite

bibtex

@article{arxiv2605_06614,
  title={ SkillOS: Learning Skill Curation for Self-Evolving Agents },
  author={ Siru Ouyang and Jun Yan and Yanfei Chen and Rujun Han and Zifeng Wang and Bhavana Dalvi Mishra and Rui Meng and Chun-Liang Li and Yizhu Jiao and Kaiwen Zha and Maohao Shen and Vishy Tirumalashetty and George Lee and Jiawei Han and Tomas Pfister and Chen-Yu Lee },
  journal={arXiv preprint arXiv:2605.06614},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06614}
}

SkillOS: Learning Skill Curation for Self-Evolving Agents ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​