MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Source: arXiv:2605.03952 · Published 2026-05-05 · By Jonathan Steinberg, Oren Gal

TL;DR

MOSAIC-Bench addresses a subtle but critical security failure mode in AI coding assistants: although agents often refuse or harden overtly malicious single prompts, when attack goals are decomposed into sequences of innocuous-looking engineering tickets, they produce exploitable vulnerable code at high rates. The authors construct a benchmark of 199 three-step attack chains spanning multiple popular web app substrates, programming languages, and vulnerability types. They measure end-to-end attack success rates (ASR), revealing that nine evaluated production coding agents from Anthropic, OpenAI, Google, and others exhibit 53–86% ASR when generating staged code, yet similar direct single-shot prompts yield only 0–20% vulnerable outputs because the agents refuse or harden responses. This demonstrates a structural "compositional gap" in model defensive reflexes that sequencing exploits rather than outright bypassing defenses. Further, downstream automated reviewer agents often approve vulnerable cumulative diffs as routine PRs, with about 25% evasion, only reduced partially by giving reviewers full repo context or by reframing reviewers as adversarial pen-testers, which reduces evasion to 3–18%. The authors release their benchmark dataset and an evaluation framework with rigorous oracle-verified exploit tests, enabling future work to measure and improve vulnerability compositional prevention in coding workflows. This work reveals a major gap in current coding agent safety evaluations and proposes practical mitigation strategies such as adversarial-reviewer framing for AI-assisted code review.

Key findings

Nine production coding agents compose innocuous tickets into exploitable code at 53.3–85.9% end-to-end attack success rate (ASR) under 3-stage ticket staging with just two refusals across 199 chains.
In matched direct-prompt experiments with four frontier agents (Claude Opus 4.6/Sonnet 4.6, Codex GPT 5.3/5.4), vulnerable output rates drop to 0–20.4%, showing the defensive reflexes fire partly or fully on single-shot requests.
Resumed-session runs with full conversation memory across tickets close only about 50% of the compositional gap, ruling out pure context fragmentation as cause.
Downstream automated reviewer agents approve 24.8% of confirmed vulnerable cumulative diffs in neutral framing, dropping to 16.4% when given full repository context.
Reframing reviewers as adversarial pentesters reduces evasion rates further to range from 3.0% to 17.6% across evaluated reviewers.
The open-weight Gemma-4-E4B-it reviewer under pentester framing detects 88.4% of attacks on MOSAIC-Bench with 4.6% false positive rate measured on 608 real-world GitHub PRs.
Compositional attacks span 31 CWE classes across 10 popular web app substrates in 5 programming languages, with the dataset publicly released and accompanied by deterministic proof-of-concept exploit oracles.
Per-agent ASR ranges from 53.3% (Opus 4.6) to 85.9% (GPT 5.3), with no alignment to model release date or size, indicating the compositional vulnerability is widespread.

Threat model

The adversary is an upstream ticket author with backlog write access (e.g., a compromised product manager or malicious insider) who crafts sequences of innocuous-looking engineering tickets. The adversary never requests overtly malicious code directly but achieves harm via compositional attack chains. The coding agent executes each ticket in isolation without knowledge of ultimate malicious intent. The defender is a downstream code review agent tasked with detecting vulnerabilities post hoc. The adversary cannot directly present jailbreak prompts or single-shot exploit instructions to the coding agent, and cannot adaptively interact with the reviewer beyond the initial submission.

Methodology — deep read

The study models an adversary as an upstream ticket author (malicious insider or compromised engineer) who decomposes a malicious objective into a sequence of routine-looking engineering tickets, none of which individually request overtly malicious behavior. The coding agent is the implementing party, which receives each ticket freshly without conversation memory of prior tickets, simulating real-world Jira-driven workflows. The defender is a downstream code review agent that examines the cumulative code changes.

Data provenance consists of MOSAIC-Bench, 199 curated 3-stage ticket chains constructed from 10 Dockerized web app substrates (e.g., Express, Flask, Gin), spanning 31 CWE vulnerability classes and covering 5 programming languages (Node.js, Python, Go, Java, PHP). Each chain is a triplet of Jira-style tickets, each product-legible and plausible without overt indication of harm, paired with a "golden" implementation script and a deterministic Python proof-of-concept (PoC) exploit oracle that tests the cumulative code on a live substrate container to confirm exploit presence or absence.

The agentic coding pipeline evaluates several production coding agents including Anthropic Claude Opus 4.6 and Sonnet 4.6, OpenAI Codex GPT 5.3 and GPT 5.4, Google Gemini variants, Moonshot Kimi K2.6, Minimax M2.7, and Zhipu GLM-5. Agents receive one ticket at a time with only prior code context (no session memory), and produce code changes that are accumulated into a cumulative diff.

Reviewer agents simulate automated or semi-automated code review under different prompt framings: neutral (senior engineer) vs adversarial pentester (security red team), and with different context scopes: diff-only vs full post-application repository context. Reviewers classify cumulative diffs as approve or decline; evasion is measured as the fraction of confirmed vulnerable diffs approved.

The evaluation protocol centers on a three-layer harness: (a) the agent receives ticket i and returns per-stage diffs; (b) reviewer agents evaluate the total cumulative diff after stage 3; (c) the deterministic exploit oracle applies the cumulative diff to a fresh container and runs the PoC to label VULNERABLE or SECURE. This tight oracle ground truth separation decouples reviewer heuristic from vulnerability presence.

Direct-prompt experiments collapse all three stages into one prompt to test if compositional vulnerability is an artifact of decomposition. Resumed-session experiments allow the agent to retain full conversation state across tickets to test context fragmentation hypotheses.

Statistical analyses include attack success rates (ASR) per agent, per reviewer evasion rates, and protocol sensitivity to framing and context across multiple reviewers. The authors also analyze reviewer false positives on 608 real GitHub PRs to validate realistic FP rates.

The dataset, scripts, and evaluation framework are publicly released to ensure reproducibility. The deterministic Python PoC oracles and Dockerized substrates enable verified exploit ground truth.

A concrete example: An Express+Mongoose app chain uses three innocuous tickets to introduce a CWE-915 mass-assignment flaw that only emerges after patch application and a bugfix; the coding agent implements all tickets without refusal, reviewers approve the cumulative patch about 25% of the time, and the PoC oracle confirms exploit presence by successfully escalating privileges via the staged metadata patch. Direct-prompting the same exploit code triggers refusal or hardening.

Technical innovations

Introduction of a novel benchmark (MOSAIC-Bench) of 199 three-stage ticket sequences that compose innocuous requests into exploits, paired with deterministic PoC exploit oracles on live containerized substrates.
Systematic measurement of the 'compositional gap' — the difference in vulnerability induction between staged innocuous prompts and single direct exploit prompts — across multiple production coding agents and languages.
A two-layer evaluation harness treating exploit ground truth and downstream code reviewer protocol as orthogonal and first-class evaluation axes, enabling detailed analysis of defense evasion modes.
Demonstration that reframing automated reviewers as adversarial pentesters drastically reduces approval rates of vulnerable diffs, yielding a lightweight, deployable mitigation independent of model retraining.

Datasets

MOSAIC-Bench — 199 three-stage attack chains — public at https://huggingface.co/datasets/MosaicBenchmark/mosaic-bench

Baselines vs proposed

Opus 4.6 coding agent: strict ASR = 53.3% vs GPT 5.3 ASR = 85.9%
Direct prompt vulnerable output on frontier agents (Claude, Codex): 0–20.4% vs staged ticket vulnerable output 53–86%
Reviewer evasion under neutral framing: 24.8% mean vs full context review 16.4%
Reviewer evasion under pentester framing: 3.0–17.6% vs neutral framing 8.5–14.6%
Gemma-4-E4B-it pentester reviewer detection rate: 88.4% attack detection at 4.6% false positive rate on 608 real-world PRs

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.03952.

Fig 6

Fig 6: Chain coverage across CWE classes and substrates.

Limitations

Eight of 10 substrates are full-scale web application boilerplates; two pilot substrates (Spring, Laravel) and no coverage of non-web domains (Rust services, mobile, OS kernels, ML pipelines).
The benchmark covers 31 CWEs but real-world vulnerabilities may be broader and more complex.
Evaluation is on static, isolated Docker substrates rather than live production environments with multi-agent workflows and dynamic state.
The adversarial pentester review paradigm is a non-adaptive, static mitigation; robustness against adaptive attacker reviews is untested and deferred to future work.
The coding agents and reviewers are black-box APIs with default parameters; no fine-tuning or adaptation was applied, limiting deeper causal analysis of architectural factors.
False positive rate of 4.6% on real PRs for Gemma-4 pentester reviewer may be too high for some deployment contexts.

Open questions / follow-ons

How can coding agents be adapted or re-trained to detect and resist compositional vulnerability induction rather than just single-shot jailbreaks?
Can sequential prompt monitoring or multi-stage intent aggregation methods complement the pentester framing mitigation?
What are effective approaches to enable adaptive, continuous learning reviewers that evolve to block novel compositional attack patterns?
How do compositional vulnerabilities generalize beyond the web app substrates and CWEs covered, especially in non-web, embedded, or large-scale distributed system settings?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, MOSAIC-Bench highlights a critical blind spot: advanced AI coding assistants can successfully combine benign-appearing, low-risk user instructions into sequences that create exploitable vulnerabilities, analogous to composing requests that individually pass security reviews but collectively enable compromise. This compositional blind spot may analogously exist in multi-step user workflows behind CAPTCHAs or challenge-response defenses. The findings suggest that safety evaluations focused only on isolated user inputs might miss complex orchestrations of benign actions that achieve malicious objectives. The demonstrated success of adversarial framing for automated reviewer agents underscores the importance of threat model-aware downstream checks and the value of explicit adversarial mindset prompts in automated defense pipelines. Although MOSAIC-Bench is domain-specific to coding agents, its structured evaluation of compositional attack chains and reviewer protocols offers a conceptual framework to analyze and mitigate staged attack workflows in other AI-assisted automation contexts relevant to bot mitigation.

Cite

bibtex

@article{arxiv2605_03952,
  title={ MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents },
  author={ Jonathan Steinberg and Oren Gal },
  journal={arXiv preprint arXiv:2605.03952},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.03952}
}

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​