Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark

Source: arXiv:2604.04226 · Published 2026-04-05 · By Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui et al.

TL;DR

This paper addresses the challenge of transforming static digital assets, particularly code repositories, into autonomous, interoperable agents within the envisioned Agentic Web ecosystem. The Agentic Web paradigm relies on autonomous agents interacting and collaborating over standardized protocols like Agent-to-Agent (A2A). However, manual creation of these agents is costly and difficult to scale. The authors formalize an "agentization" process that automatically transforms diverse digital assets into A2A-compliant agents by tackling key hurdles such as inconsistent execution environments, unstructured internal skills, and semantic interface gaps. They develop a framework called the A2A-Agentization Agent which autonomously configures environments, extracts reusable skills as tools, builds internal agent logic, and generates standardized agent cards that describe capabilities for external discoverability.

The paper also introduces A2A-Agentization Bench, the first comprehensive benchmark to evaluate agentization quality focusing on both fidelity of functionality and interoperability for multi-agent collaboration. This benchmark contains 35 richly diverse, real-world code repositories with 522 task instances encompassing single- and multi-repo scenario executions, carefully annotated with ground-truth agent skills and tested for execution correctness. Experiments with four state-of-the-art autonomous coding agent frameworks show that while automated agentization is feasible, current systems only achieve moderate success rates (~35-37%) on single-repo tasks and face notable challenges in multi-agent orchestration. The results illuminate critical bottlenecks in environment setup, skill extraction, and semantically precise self-description needed for scalable agentic web realization.

Key findings

Claude Code and EnvX frameworks achieve 100% agentization success (Pass@1) in transforming repositories into A2A agents, showcasing reliable deployment.
Execution success rates on single-repo tasks reach up to 36.9% (Claude Code) and 35.1% (EnvX), indicating partial activation of repository functionality.
Skill F1-score, measuring the precision and recall of agent self-description via AgentCards, peaks at 66.2% for EnvX and 63.0% for Claude Code, correlating strongly with orchestration success.
Orchestration success rates in complex multi-repo scenarios drop significantly to around 44.4% for best agents, revealing scalability challenges in multi-agent collaboration.
OpenHands framework, while scoring lowest on skill specification quality (F1=approx. 42%), achieves the highest overall multi-repo execution success rate (46.2%) due to execution robustness.
The benchmark’s 522 task instances are drawn from 35 repositories spanning 9 domains including vision, document processing, chemistry, and finance, ensuring realistic heterogeneity.
Task difficulty categorization reveals that over one-third of single-repo tasks are medium or hard (2+ complexity indicators), and multi-repo tasks often require cross-domain linear workflows across 2-4 repositories.
The agentization process token consumption ranges between approx. 2.3M to 4.2M tokens per repository depending on framework, indicating substantial compute overhead.

Threat model

The work does not focus on adversarial threat scenarios but rather studies constructive capabilities of agentizing digital assets under a benign setting. The assumed adversary is absent or passive. The main challenge is enabling automated environment replication, skill extraction, and semantic interface alignment without human intervention. Security, manipulation, or adversarial agent attacks are out of scope.

Methodology — deep read

The paper formalizes the agentization process as an automated pipeline converting static digital assets (focusing on code repositories) into agents compliant with the Agent-to-Agent (A2A) protocol for interoperability within the Agentic Web.

The threat model assumes an adversary-free scenario focusing on constructive automation challenges — there is no explicit adversarial attack consideration. The goal is reliable, functional, and discoverable agent generation.

Data consists of 35 curated GitHub repositories selected for diversity across 9 domains (vision, documents, finance, chemistry, security, NLP, etc.) totaling 522 evaluation tasks split into single-repo and multi-repo scenarios. The repositories include heterogeneous file formats, complex dependencies, and real-world workflows. Human experts manually annotate these repositories to extract 127 ground-truth atomic agent skills expressing distinct reusable functionalities. Tasks are programmatically generated and verified, ensuring strict dependence on repository-specific functions. Task difficulties are categorized by environment complexity, non-determinism, domain expertise, and complexity of workflows.

The agentization pipeline has four stages:

Environment Setup: Automatic synthesis of reproducible compute environments encapsulating all dependencies and configurations needed to reliably run repository code. This uses dependency manifests and config files.
Skill Extraction: Identification, wrapping, and validation of atomic functional units from code as executable tools. This requires dynamic verification in the environment.
Inner Agent Instantiation: Construction of a cognitive reasoning loop (e.g., via ReAct framework) integrating the extracted tools into a planning executor that can invoke skills appropriately.
Final Agent Generation: Creation of an AgentCard describing the agent identity and skills in a standardized, protocol-compliant metadata format for discovery and orchestration by other agents.

The authors implement the above as the A2A-Agentization Agent, which performs end-to-end autonomous repository transformation.

For evaluation, the A2A-Agentization Bench provides the executable tasks and procedures in three stages:

Agentization Process Assessment: measuring pass@k success rates of agent creation and token cost.
Capability Inheritance Assessment: deploying agents to solve single-repo tasks that require correct invocation of repository internal functions, judged by LLM-based trajectory matching.
Collaborative Execution Assessment: evaluating multi-agent workflows involving multiple agents collaborating on interdependent repositories, assessing skill description quality as F1-score against ground truth (specification quality), orchestration success (correct task dispatching), and final execution success.

Four state-of-the-art agent frameworks (Claude Code, Codex CLI, OpenHands, EnvX) are evaluated under a controlled orchestration setup using a centralized orchestrator with oracle task decompositions. Inner agent reasoning uses Claude Code agent. Results report detailed per-framework metrics for each pipeline stage, afford interpretation of strengths and limitations.

Reproducibility: The paper provides extensive benchmark data, task annotations, and procedures. The underlying models (e.g., Claude-Sonnet) are partially closed-source, so exact replication may depend on model availability. Benchmark list of repositories is publicly available, but not all auxiliary code or agent configurations are specified in full detail.

A concrete example: To agentize a repository in computational chemistry, the Agentization Agent first builds a Docker environment encapsulating required dependencies, extracts atomic functionalities like reaction balancing or molecular weight calculations as skills wrapped into callable tools, instantiates an inner agent reasoning loop that sequences these skills, and finally generates an AgentCard describing these skills for discovery. This agent is then evaluated on single- and multi-repo chemistry tasks requiring chaining computations with finance or document agents in the ecosystem, reporting success rates on activating repository capabilities and orchestrated execution.

Technical innovations

Formalization of the agentization process as a four-stage transformation (environment setup, skill extraction, inner agent instantiation, and agent card generation) to produce A2A-compliant interoperable agents from static digital assets.
Development of an autonomous A2A-Agentization Agent framework that addresses environment heterogeneity, unstructured skills extraction, and semantic interface gaps in an integrated pipeline.
Creation of A2A-Agentization Bench, the first benchmark tailored for evaluating end-to-end agentization quality focusing on fidelity and multi-agent interoperability across diverse, real-world code repositories.
Use of a unified LLM-based judge mechanism to verify agent execution correctness and task success at both single-agent and multi-agent interaction levels.
Implementation of skill and capability annotations strictly adhering to atomic, reusable functions enabling precise comparison and orchestrability assessment via Skill F1-score.

Datasets

A2A-Agentization Bench — 35 GitHub repositories, 522 tasks — curated from open-source domains with manual skill annotations and structured ground-truth workflows

Baselines vs proposed

Pass@1 Agentization Success: Claude Code = 100% vs EnvX = 100% vs Codex = 94.28% vs OpenHands = 94.28%
Single-Repo Execution Success Rate: Claude Code = 36.9% vs EnvX = 35.1% vs Codex = 34.5% vs OpenHands = 33.9%
Skill F1-Score (Agent Specification Quality): EnvX = 66.2% vs Claude Code = 63.0% vs Codex = 42.1% (OpenHands score truncated, but lower)
Multi-Repo Orchestration Success Rate: EnvX and Claude Code around 44.4% (hard scenarios higher robustness) vs others ~25-30%
Multi-Repo Execution Success Rate: OpenHands highest at 46.2% despite lowest skill F1 (Execution robustness compensates lower specification)
Token Consumption (Agentization Cost) per repo: EnvX ~4.2M vs Claude Code ~3.3M vs Codex ~2.3M vs OpenHands ~3.3M tokens

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.04226.

Fig 1

Fig 1: Conceptual illustration of agentization of digital assets for the Agentic Web. Through agentization,

Fig 2

Fig 2: Processing pipeline of repository agentization. The repository agentization process starts from raw

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Overall single-repo execution success rates are moderate (<37%), indicating partial but incomplete activation of repository functionality.
Multi-repo coordination remains challenging with orchestration success and execution success rates notably lower, exposing bottlenecks in cross-agent interoperability.
The evaluation relies heavily on LLM-based judge mechanisms whose accuracy and biases are not deeply analyzed.
The benchmark and experiments focus exclusively on code repositories as digital assets, leaving validation on other asset types (documents, media, services) unexplored.
The adversarial robustness, security, and attack resilience of the agentization pipeline and resulting agents are not evaluated, limiting operational confidence.
Underlying models like Claude-Sonnet are not fully open, which may hinder exact experimental reproduction and limit community benchmarking.
The agentization process requires heavy token consumption (millions per repo), which may impose computational and cost barriers in large-scale deployments.

Open questions / follow-ons

How can agentization pipelines be improved to significantly boost functional activation rates beyond ~37% success on real-world repositories?
What methods can enhance semantic precision and discoverability of agent capabilities to scale robust multi-agent orchestration?
How generalizable are the agentization approaches and benchmarks to non-code digital assets such as documents, media, or web services?
What adversarial vulnerabilities or misuse threats arise in fully autonomous agentization and deployment, and how can they be mitigated?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work is relevant as it advances the automation of generating autonomous agents from existing software assets capable of interoperable collaboration. Understanding such automated agentization methods and benchmarks provides insight into how malicious or benign agents might be synthesized at scale, potentially amplifying bot capabilities. Awareness of the challenges in verifying agent functional fidelity and interoperability highlights the opportunity and complexity in designing bot detection or intervention strategies. Furthermore, the formal benchmark and evaluation pipeline here could inspire analogous frameworks to rigorously test the robustness of CAPTCHA-resistant agents or the integrity of automated defenses. However, since the study focuses on constructive agent creation and interoperability rather than adversarial detection, engineers should use these findings in conjunction with threat-specific research to anticipate advanced bot ecosystems enabled by autonomous agent synthesis.

Cite

bibtex

@article{arxiv2604_04226,
  title={ Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark },
  author={ Linyao Chen and Bo Huang and Qinlao Zhao and Shuai Shao and Zhi Han and Zicai Cui and Ziheng Zhang and Guangtao Zeng and Wenzheng Tang and Yikun Wang and Yuanjian Zhou and Zimian Peng and Yong Yu and Weiwen Liu and Hiroki Kobayashi and Weinan Zhang },
  journal={arXiv preprint arXiv:2604.04226},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.04226}
}

Agentization of Digital Assets for the Agentic Web: Concepts, Techniques, and Benchmark ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​