OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Source: arXiv:2605.23657 · Published 2026-05-22 · By Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao

TL;DR

This paper addresses the challenge of systematically evaluating the rapidly growing and diverse open-source skill ecosystem designed to augment large language model (LLM) agents. Skills are structured workflow instructions intended to help LLM agents complete complex downstream tasks, but it remains unclear how well different agents and models leverage these skills, how to measure skill quality, and how to advise users on skill selection under cost-performance trade-offs. The authors propose OpenSkillEval, a novel automatic evaluation framework that generates dynamic, realistic task instances across five real-world task categories (presentation generation, web design, poster generation, data visualization, and report generation) using an artifact-driven reverse engineering approach. This allows continuous, evolving evaluation aligned with practical user needs rather than static benchmarks.

Using over 600 task instances and 30 popular open-source skills, OpenSkillEval performs systematic assessments of multiple state-of-the-art models and agent frameworks, analyzing both skill usage trajectories during execution and the quality of final outputs. Key results reveal that the mere availability of a skill does not ensure it is effectively used by an agent; skill augmentation benefits are strongly dependent on the underlying model's capabilities and the agent framework. Many popular skills do not consistently outperform baseline agents without any skills, and skilled agents often selectively read and partially follow skills. These findings highlight the necessity for dynamic, realistic evaluation and informed skill deployment in autonomous LLM agents.

Key findings

Agents explicitly read the provided SKILL.md file in only about 48% of cases under default settings; strong models like Claude Opus 4.6 only read skills ~20% of the time (Fig 3a).
Explicitly forcing skill usage via prompt increases the skill read rate to 94%, with earlier access on average (Fig 3a).
Even after reading skills, agents skip on average ~20-30% of prescribed skill steps and sometimes contradict instructions (Fig 3b).
Skill usage varies substantially by task category; for example, presentation generation skill read rates exceed 95%, but data visualization and report generation remain below 50%.
Claude Code (Claude 4.6) and GPT-5.5 (Codex framework) show the highest average task scores (4.4-4.6/5), outperforming other agents across task domains (Table 1).
Presentation and poster generation are the most challenging tasks, especially on visual design dimensions scoring below 4.0 on average.
Front-end web design tasks achieve relatively strong results across agents, especially for navigation and interaction completeness (scores ~4.7-4.9), but responsive design remains weak.
Many publicly popular open-source skills do not reliably improve performance over base agents without skills, sometimes adding execution cost without quality gains.

Threat model

n/a - The paper does not model a security adversary but an autonomous agent user environment where the challenge is how agents recognize and correctly follow provided skills to complete tasks under realistic, open-ended, and noisy contexts.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled as this is an evaluation framework paper, but the implicit challenge is an autonomous agent operating on open-ended downstream tasks with access to a set of skills. The framework measures how well agents recognize, access, and accurately follow skills in realistic, evolving task environments under practical user needs.
Data Provenance, Size, Labels, Splits: OpenSkillEval covers five downstream task categories with 677 total dynamically generated task instances derived from real-world artifacts. Artifacts and source materials are collected from publicly accessible domains such as academic papers, websites (Awwwards), open data portals (Our World in Data, Kaggle), and industry reports. No static benchmark or reference outputs exist; instead, task specifications are automatically inferred from artifacts using LLMs. Validation uses a verifier LLM to ensure task coherence and consistency.
Architecture / Algorithm: The evaluation pipeline consists of: (a) Automatic test case generation via artifact-driven backward task extraction using LLMs (e.g., Claude-4.6-Opus, GPT-5.2) that produce task specifications (T), natural language instructions (I), and source packages (S). (b) Skills collection from multiple public skill repositories, filtered by community adoption for relevance. (c) Agent trajectory analysis using the Agent Trajectory Interchange Format (ATIF) to parse and standardize execution traces from heterogeneous agent frameworks. An agent-as-judge procedure decomposes each skill into workflow steps for fine-grained comparison against actual agent trajectories to assess skill invocation, adherence, and step-level following. (d) Artifact-level analysis applies task-specific automatic scoring rubrics covering completeness, content quality, visual design, accuracy, and interaction functionality.
Training Regime: Not applicable; the study evaluates existing agent systems and skills without training new models. Evaluated agents include Claude Code (Claude 4.6 series), Codex (GPT series), Gemini CLI (Gemini 3.1 Pro), Kimi Code CLI (Kimi K2.6), Minimax, and DeepSeek V4 Pro integrated with Claude Code.
Evaluation Protocol: Each agent is prompted to complete tasks under two settings: a default scenario where skill usage is uninstructed, and a force-using setting where explicit prompts require skill invocation. Evaluation metrics include: percentage of skill read and followed steps, step-level adherence (follow/skip/contradict rates), and final artifact quality scores on a 1–5 scale covering content quality, visual design, completeness, data accuracy, interaction functionality, etc., depending on task. Trajectory and artifact evaluations are performed over 100 randomly sampled tasks per agent per category. Cross-agent and cross-task comparisons identify relative performance, and skill-level comparisons assess skill quality. Statistical significance methods are not detailed. The framework supports continuous benchmarking as artifacts and skills evolve.
Reproducibility: Code and benchmark data (task instances, skill sets) are publicly available via the project website. Skills are collected from open public repositories at a fixed snapshot in time. Detailed task schemas, evaluation criteria, and additional case studies are provided in the appendix.

Example end-to-end evaluation: For a data visualization task, realistic data and an analytical goal are sampled from an open data portal. The LLM generates a concrete task specification and instruction. An agent with a data visualization skill is tasked to produce a chart. Execution traces in ATIF format record how and when the agent reads and follows the skill workflow steps. The final artifact (chart image) is automatically scored for insight expression, visual quality, completeness, and data accuracy against expected properties derived from the task specification. The framework scores the agent’s performance, compares it to baseline agents without skills, and notes how fully skill steps were followed or skipped.

Technical innovations

Artifact-driven dynamic test case generation pipeline that infers realistic user task intents by reverse engineering from real-world artifacts rather than relying on static benchmarks.
Agent trajectory trace analysis using the Agent Trajectory Interchange Format (ATIF) combined with an agent-as-judge method to assess fine-grained skill invocation and adherence during runtime.
Automatic skill-level and agent-level benchmarking under unified task settings spanning five diverse real-world task categories with no fixed canonical outputs.
Force-using skill invocation setting that explicitly prompts agents to use skills, allowing controlled measurement of skill usage effects and agent autonomy.

Datasets

OpenSkillEval benchmark — 677 dynamically generated task instances — derived from publicly available artifacts and datasets including Awwwards websites, Kaggle datasets, industry reports, academic papers, and open data portals (Our World in Data).
Skill collection — 30 popular open-source skills — curated from public skill repositories such as clawhub.ai, skills.sh, openskills.space, and skillsmp.com.

Baselines vs proposed

Claude Opus 4.6 Presentation Generation Content Quality: 4.48 ± 0.35 vs GPT-5.3-codex 3.55 ± 0.52
Front-end Web Design Visual Design: GPT-5.5 4.71 ± 0.53 vs Kimi K2.6 4.10 ± 1.30
Data Visualization Insight Expression: Claude Opus 4.40 ± 0.62 vs GPT-5.3-codex 2.79 ± 1.12
Report Generation Completeness: Claude Sonnet 4.81 ± 0.50 vs GPT-5.3-codex 3.21 ± 1.19
Poster Generation Visual Design: Claude Opus 4.00 ± 0.60 vs GPT-5.3-codex 3.19 ± 0.76
Agent average overall artifact quality: Claude Code agents ~4.43–4.51 vs GPT-5.3-codex 3.76
Skill usage rate (skill.md file read): default ~48% average vs force-using ~94% average (Fig 3a)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.23657.

Fig 1

Fig 1: Overview of the OpenSkillEval framework. The framework supports automatic test case

Fig 2

Fig 2 (page 3).

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8: Web-based interface for human evaluation of generated task instances.

Limitations

The evaluation framework focuses on publicly available open-source skills with community adoption but may omit many emerging or less popular skills.
Agent autonomy in skill use is preserved, but the balance between forced usage and naturalistic behavior may not fully capture real-world usage patterns.
The evaluation relies heavily on automatic or LLM-based artifact scoring and verification, which may introduce bias or errors compared to human evaluation.
No detailed adversarial attacks or robustness testing of skills or agent misuse scenarios were conducted.
The analysis focuses on a snapshot in time for skills and agent versions; ecosystem dynamics and continual updates could change results over time.
Agents evaluated are limited to a selected set of state-of-the-art but not exhaustive; generalization to other models or architectures is unknown.

Open questions / follow-ons

How can skill usage be better integrated or enforced in autonomous agents to minimize step skipping or contradiction while preserving flexibility?
Can skill design and format be standardized or improved to increase alignment with diverse agent architectures and improve transferability across tasks?
What are the effects of adversarial or corrupted skills on agent behavior and output quality, and how can agents detect or mitigate them?
How can automatic artifact evaluation be improved with multimodal or human-in-the-loop approaches to better capture visual and content quality nuances?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, OpenSkillEval provides a methodologically rigorous framework for auditing complex LLM agent workflows augmented with structured skill instructions. The work highlights that simply adding predefined workflow skills to an LLM agent does not guarantee their effective use or consistent performance improvements. This insight underlines the importance of dynamic, task-grounded evaluation rather than static benchmarks in assessing agent reliability and robustness, which is critical when deploying LLM agents in adversarial or cost-sensitive environments such as bot detection pipelines. Furthermore, the trajectory trace analysis technique could inspire analogous methods for tracing step-wise agent behavior and skill use in security-critical AI systems, potentially aiding anomaly detection or auditing of autonomous AI actions. Finally, recognizing that skill augmentation depends heavily on the synergy between model capabilities and agent framework may guide practitioners to carefully select or tailor skills for bot defense use cases rather than relying on popular or generic augmentations.

Cite

bibtex

@article{arxiv2605_23657,
  title={ OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents },
  author={ Jiahao Ying and Boxian Ai and Wei Tang and Siyuan Liu and Yixin Cao },
  journal={arXiv preprint arXiv:2605.23657},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.23657}
}

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​