Skip to content

A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web

Source: arXiv:2605.09283 · Published 2026-05-10 · By Shusaku Egami, Masahiro Hamasaki

TL;DR

This paper addresses the challenge of verifying and reusing AI-generated content (AIGC) reliably in the emerging Agentic Web, where autonomous AI agents generate and reuse content extensively. Existing mechanisms lack transparency around the provenance, generation context, and trustworthiness of AIGC, leading to risks such as chained hallucinations and license violations when content is reused. The authors propose a novel framework that automatically attaches structured metadata—including modularized prompts, contexts, internal thoughts, model details, hyperparameters, and confidence scores—to AIGC at generation time. This metadata is encapsulated in JSON-LD format and cryptographically sealed with verifiable credentials (VCs) to ensure provenance and integrity.

They demonstrate the framework’s utility by applying it to instruction-following fine-tuning tasks on ComplexBench. Using multiple teacher LLMs, the framework generates rich structured AIGC with metadata, which is then curated via prompt-level fidelity checks before fine-tuning a smaller student model. Results show that selectively fine-tuning on curated AIGC based on structured metadata improves instruction-following adherence metrics (RFR and FRFR) compared to baselines using random data selection. This confirms that explicitly managing prompt structure and generation conditions enhances reuse quality. Overall, the work lays important foundations to support reliable AIGC reuse in an increasingly autonomous web ecosystem.

Key findings

  • 74% of newly detected webpages by web crawlers in April 2025 contained AI-generated content (AIGC), highlighting the scale of the Agentic Web shift.
  • The proposed metadata schema captures modularized prompt components (Role, Background, Requirements, Example, OutputFormat), generation context, model info, hyperparameters, confidence scores, and cryptographic proofs in JSON-LD format.
  • A total of 1,150 main POML (Prompt Orchestration Markup Language) files and 5,024 modular prompt files were generated using GPT-5-mini to structure ComplexBench prompts.
  • Fine-tuning a student Llama-3.2-1B-Instruct model on curated AIGC yielded a Requirements Following Ratio (RFR) of 47.30%, surpassing random selection fine-tuning RFR of 45.45% (+1.85%) and also improving Full Requirements Following Ratio (FRFR) from 15.79% to 16.07%.
  • Teacher models’ RFR ranged between 61.17% (gpt-oss-20b) and 65.02% (Qwen3-32B), showing student gap to state-of-the-art but relative improvements from curation.
  • Confidence scores are derived by aggregating token-level log probabilities and attached as metadata enabling mechanical quality assessment.
  • Verifiable Credentials cryptographically sign the entire AIGC and metadata bundle using issuer’s private key, ensuring tamper-evidence and provenance verifiability.
  • Evaluation relies on GPT-5.1-mini as a judge for instruction-following fidelity on ComplexBench’s specific prompt questions.

Threat model

The adversary is an untrusted AI agent or content issuer who may generate and publish low-quality, hallucinated, or non-compliant AI-generated content without proper provenance or licensing. The framework assumes the adversary cannot break cryptographic signatures or forge verifiable credentials but could attempt to distribute tampered or misleading AIGC. Honest AI agents use the verifiable credentials and structured metadata to verify content integrity, provenance, and adherence to prompt instructions before reuse.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary is an untrusted AI agent or content issuer who might publish hallucinated or low-quality AIGC without provenance or license compliance. The framework assumes agents want to verify reliability, reproducibility, and attribution. It does not allow adversaries to break cryptographic signatures or forge verifiable credentials.

  2. Data: The main dataset used is ComplexBench, designed to evaluate complex instruction-following with compositional constraints. The authors created modular prompt datasets from ComplexBench’s prompts by decomposing them into Role, Background, Requirements, Example, and OutputFormat modules using the POML markup language. GPT-5-mini was used to process and modularize prompts, resulting in 1,150 main POML files and 5,024 total modular files.

  3. Architecture / Algorithm: The framework encapsulates raw AIGC outputs generated by teacher models (Qwen3-32B, gpt-oss-20b, Llama-3.1-8B-Instruct) along with their modular prompt components, model details, hyperparameters, and confidence scores into JSON-LD formatted metadata. Confidence scores are computed by aggregating token log probabilities from model inference. Optional chain-of-thought internal reasoning is extracted using tag-based pattern matching. All metadata is cryptographically signed and issued as Verifiable Credentials to assure integrity and provenance.

  4. Training Regime: The fine-tuning target is a smaller student model Llama-3.2-1B-Instruct. Two regimes were compared: (i) Random-FT, where training samples were randomly selected from outputs of the three teacher models, and (ii) Curated-FT, where the single best output per prompt was selected based on metadata-driven instruction-following fidelity. The models were trained on 68% of ComplexBench training data and evaluated on the held-out 32% test split. Other training hyperparameters are not specified in detail.

  5. Evaluation Protocol: Outputs from the student fine-tuned models were evaluated using GPT-5.1-mini as an automated judge against ComplexBench’s question sets, measuring Requirements Following Ratio (RFR) — percentage of questions answered affirmatively, and Full Requirements Following Ratio (FRFR) — percentage of prompts fully satisfying all instructions. Baselines included teacher models and random selection fine-tuning. No explicit mention of cross-validation, statistical significance testing, or distribution shift evaluation. The evaluation reveals relative improvements from metadata-driven curation.

  6. Reproducibility: The paper does not mention releasing code, dataset splits, or trained weights. ComplexBench and LLMs used are publicly known or accessible. The metadata schema is explained in detail, promoting conceptual reproducibility, but exact regeneration of results requires access to teacher models and the LLM evaluators.

Example episode end-to-end: Starting with ComplexBench prompts, GPT-5-mini decomposes each prompt into modular POML components. Multiple teacher models then generate outputs using these prompts, attaching the full structured metadata including prompts, model info, hyperparameters, confidence, and self-generated chain-of-thought reasoning when available. These outputs are structured into verifiable credentials signed cryptographically. A curation agent evaluates each output’s adherence to prompt constraints mechanically via modular prompt components and associated metadata, selecting the best output per prompt. This curated data trains a student Llama-3.2-1B-Instruct model for instruction-following improvement, which is then evaluated with GPT-5.1-mini on held-out prompts from ComplexBench.

Technical innovations

  • Automatic extraction and modularization of prompts into structured components (Role, Background, Requirements, Example, OutputFormat) using POML markup and LLM processing.
  • Comprehensive metadata schema encapsulating full content context including prompts, model parameters, confidence scores, and internal reasoning attached to each AIGC output in JSON-LD format.
  • Integration of Verifiable Credentials (VCs) to cryptographically attest provenance and integrity of generated content and its metadata.
  • Metadata-driven curation mechanism mechanically evaluating instruction-following fidelity at prompt module granularity to select high-quality training data for fine-tuning.
  • Demonstration that fine-tuning student models on curated AIGC improves instruction-following metrics over random selection baselines.

Datasets

  • ComplexBench — ~thousands of prompts with compositional constraints — public benchmark focusing on instruction following

Baselines vs proposed

  • Llama-3.2-1B-Instruct (Random-FT): RFR = 45.45%, FRFR = 15.79% vs proposed Curated-FT: RFR = 47.30%, FRFR = 16.07%
  • Teacher Llama-3.1-8B-Instruct: RFR = 63.02%, FRFR = 33.24%
  • Teacher gpt-oss-20b: RFR = 61.17%, FRFR = 37.67%
  • Teacher Qwen3-32B: RFR = 65.02%, FRFR = 39.89%

Limitations

  • Evaluation metrics rely solely on an LLM-based evaluator (GPT-5.1-mini) without human calibration or inter-annotator agreement checks, raising questions about ground truth reliability.
  • No adversarial robustness testing or evaluation against intentionally manipulated or malicious AIGC inputs to verify security of verifiable credentials.
  • Limited scope focusing only on instruction-following fidelity; other quality aspects like factual correctness, fluency, or ethical compliance are not addressed.
  • Experiment applies only to ComplexBench dataset and instruction-following fine-tuning; generality to other tasks or modalities is untested.
  • Incentives for widespread adoption of structured metadata by AIGC issuers remain unclear, limiting practical deployment in real-world agentic web environments.
  • Details on training regime hyperparameters, epochs, and hardware used for fine-tuning are sparse, affecting reproducibility.

Open questions / follow-ons

  • How effective are verifiable credentials in detecting and mitigating malicious agents attempting to inject false or low-quality AIGC in the Agentic Web?
  • What incentives or economic models can motivate AIGC issuers to adopt prompt-aware metadata frameworks given the primarily indirect benefits?
  • How does the framework generalize to multi-modal AI-generated content beyond textual data and to diverse downstream tasks?
  • Can human-in-the-loop or ensemble evaluation methods improve reliability beyond LLM-based automated judges for instruction-following and content quality?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this framework highlights the importance of considering provenance and generation metadata when interacting with AI-generated content on the web. As attackers may automate content generation via LLM-based agents in bot campaigns, integrating structured metadata and verifiable credentials can help detect and filter unreliable or malicious outputs early. Additionally, such metadata-aware curation methods could be employed to evaluate the fidelity and trustworthiness of prompt outputs in security-sensitive pipelines, preventing chained hallucinations or propagation of erroneous data that can compromise detection or challenge generation. More broadly, as the AI ecosystem shifts toward autonomous agents reusing content, maintaining explicit provenance and confidence measures becomes critical in any system relying on AI assistance or automation for security decisions.

Cite

bibtex
@article{arxiv2605_09283,
  title={ A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web },
  author={ Shusaku Egami and Masahiro Hamasaki },
  journal={arXiv preprint arXiv:2605.09283},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.09283}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution