Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

Source: arXiv:2606.18198 · Published 2026-06-16 · By Xiaojun Jia, Jie Liao, Simeng Qin, Ke Ma, Wenbo Guo, Yebo Feng et al.

TL;DR

This paper addresses a novel security blind spot in agent skill scanners used for LLM-based systems. Current skill scanners primarily analyze textual artifacts such as documentation, manifests, and source code, but largely ignore image-based resources bundled with agent skills. The authors show that malicious instructions can be hidden inside images, which multimodal agents interpret at runtime, enabling stealthy operational attacks that evade existing defenses. To demonstrate this, they introduce SKILLCAMO, an automated attack that hides harmful commands inside images while rewriting textual documentation to naturally incorporate those images, maintaining a benign appearance. Their evaluation across six state-of-the-art skill scanning tools shows SKILLCAMO achieves very high attack success rates (78-100%), exposing the insufficiency of artifact-level scanners.

To mitigate this threat, the paper proposes EXECSCAN, a novel execution-grounded multimodal scanning framework that jointly analyzes code, text, image content, and simulates agent execution under plausible contexts. EXECSCAN performs intent extraction, behavior reconstruction, latent instruction recovery, abuse risk analysis, and deliberative execution simulation, enabling it to detect execution-level risks hidden in cross-modal instructions. Experimental results demonstrate that EXECSCAN reduces SKILLCAMO attack success to as low as 8%, and achieves the best precision, recall, F1, and lowest false positive rate on adversarial and benign skill sets. The paper exposes a fundamental limitation of prior scanning approaches and highlights the need for multimodal and execution-informed defenses in agent skill security.

Key findings

SKILLCAMO achieves 78-100% attack success rate (ASR) across six baseline scanners, showing existing scanners fail to detect image-hidden malicious instructions.
EXECSCAN reduces SKILLCAMO ASR to as low as 8%, 31%, and 17% for SKILLCAMO, CLOZE, and SPLIT attack variants respectively, indicating its effectiveness in recovering latent malicious instructions.
EXECSCAN achieves the highest overall defense performance with Precision of 85.6%, Recall of 82.0%, F1 score of 83.8%, and false positive rate (FPR) of 27.4%, significantly outperforming all baselines (next best Recall 65%, FPR 90.9%).
Removing visual analysis in EXECSCAN causes detection to drop drastically from 92% to 1%, confirming visual cross-modal reasoning is the key factor.
Number of simulated execution contexts (K) improves detection, with EXECSCAN reaching 98% detection at K=5.
SKILLCAMO attack skills significantly transfer across scanners with average off-diagonal ASR of 77.5%, whereas EXECSCAN shows near zero attack success transfer on others' attacks (0-4%).
Iterative scanner feedback refines SKILLCAMO attacks to evade detection by baseline scanners quickly (ASR approaches 100% by iteration 5), but EXECSCAN maintains low ASR over all iterations.
Adding explicit image-warning prompts to baseline scanners does not meaningfully reduce ASR, showing semantic multimodal reasoning is necessary.

Threat model

The adversary is a malicious skill developer or software supply chain attacker who can modify the skill package by injecting hidden malicious operational instructions encoded within images. They can rewrite textual documentation to camouflage and naturally reference these images but cannot directly alter executable code or system runtime controls beyond the skill package. The attacker aims to evade static textual and code-based scanners by leveraging multimodal instruction reconstructed only at runtime by agents capable of jointly interpreting text and images. Adversaries cannot tamper with the scanning framework or system environment itself and must rely solely on bundled skill artifacts.

Methodology — deep read

Threat model & assumptions: The adversary is an attacker who can modify an agent skill package before deployment by injecting malicious commands hidden inside images bundled with the skill. The attacker rewrites the textual documentation to naturally reference these images, preserving the skill's benign appearance. The attacker aims for execution-time malicious behavior (e.g., data exfiltration, privilege escalation) that existing scanners do not detect because they mainly analyze text, code, and metadata. The adversary cannot alter the underlying skill execution environment or code beyond these modifications.
Data: The authors curated 100 clean base skills from public repositories (ClawHub, skill0.io, GitHub), carefully vetted to be benign. From these, 400 adversarial samples were generated by applying four attack types: SKILLJECT (text-only), SKILLCAMO (full instruction in image), CLOZE (instruction tokens split between image and text), and SPLIT (complementary fragments across image and text). Additionally, 200 benign evaluation skills were collected to measure false positives.
Architecture / algorithm: SKILLCAMO attack pipeline consists of: (a) skill-context analysis to understand structural elements in SKILL.md (purpose, section hierarchy, layout cues); (b) command extraction from a harmful script and rendering as an image carrier, designed to look like benign workflow illustrations; (c) full-document semantic rewriting of SKILL.md using an LLM conditioned on the original context and injected image to create seamless image references; (d) scanner-guided iterative refinement where rewritten skills are submitted to scanners, and feedback guides further rewriting to reduce suspicious signals.

EXECSCAN defense pipeline operates in two main phases: (I) Multimodal artifact analysis: uses a multimodal LLM to jointly analyze documentation, code, images, and metadata to infer intent profiles, behavior footprints, and preliminary abuse risk. (II) Deliberative execution simulation: generates multiple plausible usage scenarios for the skill, recovers surface and latent instructions (including those in images), reconstructs projected execution chains (file operations, network calls, subprocesses), assesses risk for each context comparing surface-only vs full execution risk, and aggregates maximum risk across contexts to produce a detection score.

Training regime: Both SKILLCAMO and EXECSCAN rely on gpt-5-mini as their foundational LLM backbone. The SKILLCAMO rewriting loop limits scanner-feedback iterations to 5 to optimize stealth. EXECSCAN simulates K=5 plausible usage contexts per skill for risk assessment. All baseline scanners use predetermined configurations with Claude Code CLI and Claude Haiku 4.5 models for consistency across experiments.
Evaluation protocol: The evaluation uses ASR (attack success rate) to measure how often attacks evade detection (lower better for defenders), and standard classification metrics (Precision, Recall, F1, False Positive Rate). Results are reported for each scanner under no attack (None row) and under attack variants across 100 base skills. Statistical significance is validated with Fisher's exact tests. Ablations remove visual analysis, execution simulation, and multi-context reasoning to evaluate component effects. An ASR heatmap demonstrates cross-scanner attack transferability. False positive rates are measured on the 200 benign skill evaluation set.
Reproducibility: The paper references public tools and datasets such as ClawHub and skill0.io. Scanners include published tools (Cisco Skill Scanner, SkillFortify, Skill Vetter). The LLM backbone GPT-5-mini and Claude models are not standard public releases but representative state-of-the-art open or commercial models. Full code and weights release are not explicitly stated; appendices reportedly contain further implementation details. Dataset splits and preprocessing are specifically described.

Concrete example: Starting with a benign skill and a harmful script, SKILLCAMO extracts the malicious command, encodes it visually as an image asset, then uses an LLM rewriter to revise the SKILL.md so the image fits naturally as a benign workflow illustration. The modified skill package is submitted to existing scanners which do not flag it, resulting in high attack success rates. In contrast, EXECSCAN jointly analyzes the multimodal artifacts, simulates multiple agent execution scenarios where the image-derived instructions are recovered, reconstructs the behavior chain leading to potential exfiltration or privilege escalation, and thereby detects the malicious skill with high recall and low false positive rate.

Technical innovations

SKILLCAMO attack framework automates hiding malicious operational instructions inside skill-bundled images combined with full-document semantic rewriting to embed the image naturally in skill workflows.
EXECSCAN integrates multimodal artifact analysis with execution-grounded deliberative simulation to recover latent instructions and reconstruct cross-modal executable behavior chains for improved skill risk detection.
Use of iterative scanner-feedback loop in SKILLCAMO attack generation to progressively reduce overt suspicious signals and optimize evasion against static text/code-focused scanners.
Explicit modeling and comparison of surface-only versus latent execution-level risks across multiple plausible agent usage contexts to detect hidden malicious skill behaviors.

Datasets

Clean Base Skills — 100 — curated from CLAWHUB, skill0.io, GitHub
Adversarial Skills — 400 — generated from clean base skills with four attack methods including SKILLCAMO
Benign Evaluation Skills — 200 — from official repositories and scanner-vetted websites

Baselines vs proposed

CISCO Skill Scanner: ASR = 78-96% vs EXECSCAN ASR = 8%
HSS: ASR = 87% vs EXECSCAN ASR = 31%
AUDITOR: ASR = 92% vs EXECSCAN ASR = 8%
SLOWMIST: ASR = 78% vs EXECSCAN ASR = 17%
VETTER: ASR = 87% vs EXECSCAN ASR = 8%
SKILLFORTIFY: ASR = 100% vs EXECSCAN ASR = 8%
EXECSCAN: Precision = 85.6%, Recall = 82.0%, F1 = 83.8%, FPR = 27.4%; next best Recall 65%, FPR 90.9%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.18198.

Fig 4

Fig 4: Cumulative ASR across scanner-feedback iterations (t = 1 to t = 5). EXECSCAN maintains

Limitations

The evaluation uses curated datasets and synthetic adversarial skills derived from a limited number of base skills; generalization to a broader and evolving skill ecosystem is untested.
EXECSCAN relies on large multimodal LLMs (gpt-5-mini, Claude Haiku) and heavy execution simulation, which may impose computational costs limiting deployment at scale.
The threat model focuses on malicious instructions hidden in images and joint text-image interpretation; other covert channels or advanced obfuscation techniques remain unexplored.
No adversarial attacks explicitly targeting EXECSCAN's multimodal and execution reasoning pipeline are evaluated; robustness against adaptive attackers is uncertain.
The datasets and implementations appear reliant on unpublished or restricted large models, which may affect reproducibility and broad adoption.
The paper provides limited quantitative analysis on false positives in realistic deployment scenarios, especially on benign unseen skills outside curated sets.

Open questions / follow-ons

How to generalize EXECSCAN’s multimodal execution simulation to handle emerging modalities beyond text and images, such as video or audio in skills?
What countermeasures can robustly detect or prevent evasion by adversaries that actively attempt to manipulate execution simulation or induce incomplete latent instruction recovery?
Can runtime monitoring or dynamic sandboxing complement or replace offline multimodal scanning for improved detection of hidden instructions?
How to scale EXECSCAN or similar frameworks efficiently to large public agent skill marketplaces with thousands of submissions and rapid iteration?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights a crucial new attack vector via hidden instructions in image assets that evade traditional text/code scanners. It underlines the importance of multimodal analysis where visual content included in extensions, plugins, or inputs can encode malicious payloads only interpretable by agents with multimodal capabilities. Defense systems must move beyond static artifact inspection to simulate realistic multi-modal execution scenarios for reliable threat detection.

Particularly for CAPTCHA-like systems offering extensions or agent plugins, this research suggests including visual content inspection and cross-modal integration in security audits. The execution-grounded, multimodal scanning paradigm could inspire more robust vetting of complex inputs involving images or visual UI elements that adversaries might wield to bypass conventional text-based filtering or behavior heuristics. Thorough multimodal threat modeling is essential to keep pace with emerging multimodal AI agent attack surfaces.

Cite

bibtex

@article{arxiv2606_18198,
  title={ Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners },
  author={ Xiaojun Jia and Jie Liao and Simeng Qin and Ke Ma and Wenbo Guo and Yebo Feng and Aishan Liu and Yang Liu },
  journal={arXiv preprint arXiv:2606.18198},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.18198}
}

Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​