Skip to content

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Source: arXiv:2605.08043 · Published 2026-05-08 · By Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang et al.

TL;DR

SCOPE addresses a core failure mode in complex text-to-image generation that the authors term the 'Conceptual Rift': even when multi-step systems retrieve information, verify outputs, and attempt repairs, the semantic requirements embedded in a prompt (called 'semantic commitments') are not tracked as unified operational units across these stages. A commitment satisfied in one step can be silently lost or misattributed in another. The paper formalizes this problem and proposes a specification-guided orchestration framework that materializes commitments as a persistent, structured data object — a triple z = (E, C, U) of entities, constraints, and unresolved unknowns — which every pipeline stage reads from and writes back to. Skills (retrieval, reasoning, repair) are invoked conditionally based on the current state of this specification rather than applied as fixed pipeline steps.

The framework, SCOPE, runs a fixed core loop (Decomposer → Synthesizer → Generator → Verifier) up to T=3 iterations, with the structured specification serving as the shared interface. Verification is item-level rather than holistic: each entity and constraint receives a pass/fail/uncertain verdict, and failures are routed either to further semantic resolution (if unknowns remain) or to targeted repair (prompt rewriting, local editing, or full regeneration). This routing logic is the central architectural novelty compared to prior self-refinement systems that use free-form critique.

To evaluate commitment-level fulfillment, the authors introduce Gen-Arena, a 300-instance human-annotated benchmark across six categories (cartoon, game, sports, entertainment, competition, ceremony) with 1,954 entities and 2,533 constraints, and a strict evaluation metric called Entity-Gated Intent Pass Rate (EGIP) that requires every entity and every dependent constraint to pass before an instance is credited. SCOPE achieves 0.60 EGIP on Gen-Arena versus 0.21 for the next-best baseline (Nano Banana Pro), and also outperforms all baselines on external benchmarks WISE-V (0.907) and MindBench (0.61).

Key findings

  • SCOPE achieves 0.60 EGIP on Gen-Arena, outperforming the next-best baseline Nano Banana Pro (0.21) by 39 percentage points; all other evaluated models score ≤ 0.07 (Table 1).
  • On WISE-V, SCOPE achieves an overall WiScore of 0.907 versus Nano Banana Pro's 0.876, a 3.5% absolute improvement, and scores first in 5 of 6 reported sub-categories (Table 2).
  • On MindBench (from Mind-Brush), SCOPE achieves 0.61 overall versus Nano Banana Pro's 0.41, a 48.8% relative improvement in overall accuracy; knowledge sub-score is 0.59 and reasoning 0.63 (Table 2).
  • Ablation: removing retrieval and reasoning skills drops EGIP from 0.60 to 0.22 (near Direct single at 0.21), showing structured decomposition alone is insufficient without semantic resolution of unknowns (Table 3).
  • Ablation: removing repair skills (SCOPE w/o repair) drops EGIP from 0.60 to 0.42, an 18-point gap attributable solely to verification-guided post-generation repair (Table 3).
  • Self-refine without a structured specification — using the same 3-generation budget but free-form critique and prompt rewriting — achieves only 0.39 EGIP, essentially matching best-of-3 direct generation (0.40), indicating that free-form iterative refinement does not reliably recover from missing entity failures (Table 3).
  • High item-level entity pass rates do not predict high EGIP: Qwen-Image and Z-Image-Turbo both achieve entity-level pass > 0.83 yet EGIP ≈ 0.01–0.02, demonstrating that cascading entity-to-constraint failures dominate under a strict instance-level pass criterion (Table 4).
  • Gen-Arena contains 300 instances, 1,954 entities, 2,533 constraints, and 310 reference images across 6 categories, with all annotations done by human annotators rather than automatic pipelines.

Threat model

n/a — SCOPE is a generative image pipeline paper, not a security or adversarial robustness paper. There is no explicit adversarial threat model. The implicit 'threat' is pipeline-internal: the Conceptual Rift describes how semantic commitments silently lose continuity across stages of the same system, not an external adversary. No user-adversarial, model-extraction, or prompt-injection threat is discussed.

Methodology — deep read

Threat model and assumptions (pipeline design perspective): The adversarial framing here is not a security threat model but a system-design threat model: the implicit 'adversary' is the Conceptual Rift itself — the possibility that a commitment grounded in step 1 is not recognizable as the same unit in step 3. The framework assumes access to a capable MLLM backend (GPT-5.4), a strong image generator/editor (Nano Banana Pro), a web retrieval API (Google Search), and a capable multimodal evaluator (Gemini 3-Pro). The framework does not assume the base generator can handle complex scenes in one shot; instead, it assumes iterative correction is feasible within T=3 attempts. There is no explicit adversarial user model.

Data — Gen-Arena benchmark: Gen-Arena is manually constructed. Human annotators write natural-language prompts and collect reference images for entities whose appearance cannot be reliably described by text alone (e.g., specific real athletes, fictional characters). Annotators then identify all required visible entities per prompt and write atomic constraints (attribute, relation, layout). Each constraint is linked to the set of entities it depends on, enabling the entity-gated evaluation cascade. The final dataset: 300 instances, 1,954 entities, 2,533 constraints, 310 reference images, spanning six categories (cartoon, game, sports, entertainment, competition, ceremony). No train/validation/test split is reported — it appears to be a pure held-out evaluation benchmark. Data provenance is not public (no code/data release mentioned in the paper).

Architecture — Structured Semantic Specification: The core data structure is z = (E, C, U): E is the set of target entities, C is the set of verifiable constraints (attribute, relation, layout), and U is the set of unresolved unknowns. Each unknown in U is owned by a prompt-level, entity-level, or constraint-level item, creating an ownership graph. When a skill resolves an unknown, it writes back to the owning node. When the verifier detects a failure, it maps that failure through the same graph to identify whether the root cause is an unresolved unknown or a pure visual realization failure. This ownership graph is the key structural innovation over prior work that tracks constraints as an unordered list.

Architecture — Pipeline and Skills: The fixed core loop is: (1) Decomposer parses the user prompt into z₀ = (E₀, C₀, U₀) using the MLLM. (2) Synthesizer consolidates resolved information in the current specification into a coherent generation prompt s_t. (3) Generator produces image y_t from s_t (using Nano Banana Pro). (4) Verifier checks y_t against all items in I = E ∪ C, returning a per-item verdict ∈ {pass, fail, uncertain} and a textual reason. Retrieval is invoked before synthesis when a commitment requires external factual or visual evidence (e.g., what a specific athlete looks like). Reasoning is invoked when an unknown is implicit or underspecified and can be resolved by MLLM inference. Repair is invoked after verification when a failure is no longer associated with an unresolved unknown: prompt rewriting (when the synthesis prompt misrepresents the spec), local image editing (for localized defects), or full regeneration (for broad or entangled failures). The choice among repair modalities is policy-driven by the Verifier's textual reason, though the precise decision logic is not fully specified in the paper.

Training regime: SCOPE is a training-free orchestration framework — there is no model training described. All intelligence comes from the frozen MLLM backend (GPT-5.4) and generator (Nano Banana Pro). No epochs, batch sizes, or optimization procedures are reported. The maximum number of iterations T=3 is the only tunable lifecycle parameter mentioned.

Evaluation protocol: On Gen-Arena, the primary metric is EGIP (Equation 2): an instance passes only if the product of all entity pass indicators and all constraint pass indicators equals 1 — a logical AND over all requirements. Gemini 3-Pro serves as the automated evaluator, judging entities and constraints item by item; when reference images are provided, the evaluator compares against them explicitly. Secondary metrics reported are Entity Pass (item-level entity satisfaction rate) and Gated Constraint Pass (constraint satisfaction rate after applying entity gates). On WISE-V, the official WiScore metric is used per category and overall. On MindBench, accuracy across knowledge and reasoning sub-tasks is used. Ablation variants (Table 3) hold the generation backend and budget (3 attempts) fixed while removing skills one at a time: Direct (single), Direct (best-of-3), Self-refine w/o spec, SCOPE w/o R&R, SCOPE w/o repair, full SCOPE. No statistical significance tests (e.g., bootstrap confidence intervals) are reported.

Concrete end-to-end example (from Figure 3/4): Consider the squash final prompt: 'Nouran Gohar stands at the center of a small podium, raising a large silver trophy above her head with both hands. Hania El Hammamy stands to the right, clapping and smiling toward her. A gold medal hangs around Nouran's neck...' The Decomposer extracts entities {Nouran Gohar, Hania El Hammamy, silver trophy, presentation podium, gold medal} and constraints such as 'Nouran raises trophy [E1, E4]', 'Hania stands to the right [E2, E1]', 'gold medal around neck [E1, E5]'. Because Nouran Gohar and Hania El Hammamy are real persons whose appearance is not fully specified by name alone, U contains unknowns flagged as requiring visual evidence. Retrieval fetches reference images. The Synthesizer produces a detailed prompt incorporating retrieved appearance cues. After first generation, if the Verifier marks 'Hania El Hammamy' as uncertain (identity not confirmed), that item is added to F_t; since its unknown is now exposure-resolved (reference was retrieved), SCOPE invokes repair rather than further retrieval, selecting local editing or regeneration depending on failure breadth. The final image passes all 5 entity checks and all 5 constraint checks, receiving EGIP=PASS.

Reproducibility: No code release or data release is mentioned. The benchmark (Gen-Arena) appears to be a new contribution but no repository link is provided in the truncated text. The framework relies on proprietary backends (GPT-5.4, Nano Banana Pro, Google Search API, Gemini 3-Pro), making full reproduction impossible without API access. Model names like 'GPT-5.4' and 'Nano Banana Pro' suggest this is a May 2026 preprint using frontier models that may not be publicly accessible at the time of reading.

Technical innovations

  • Structured Semantic Specification z = (E, C, U) with ownership links between unknowns and their parent commitments, enabling cross-stage traceability that prior agentic generation pipelines (e.g., T2I-CoPilot, GEMS) lack because they use intervention-specific intermediate representations.
  • Verification-guided repair routing that distinguishes semantic gaps (unresolved unknowns → retrieval/reasoning) from visual realization failures (spec is complete but image violates it → repair), rather than applying repair uniformly as in self-refinement approaches (e.g., Li et al. 2025b).
  • Entity-Gated Intent Pass Rate (EGIP), a strict instance-level pass criterion that gates constraint evaluation on entity satisfaction, explicitly modeling the prerequisite dependency structure that holistic alignment scores and independent checklist protocols (as used by prior benchmarks) do not capture.
  • Gen-Arena benchmark with human-annotated entity-constraint dependency graphs and reference images for identity-grounded entities, providing a structured evaluation resource for complex multi-entity scene generation across six real-world categories.
  • Conditional skill invocation anchored to specific unresolved or violated items in the specification, contrasting with fixed-pipeline multi-step methods where retrieval, reasoning, and repair are applied uniformly regardless of per-commitment status.

Datasets

  • Gen-Arena — 300 instances, 1,954 entities, 2,533 constraints, 310 reference images, 6 categories — human-annotated, introduced by this paper, no public release mentioned
  • WISE-V — size not specified in truncated text — Niu et al. 2025, external benchmark for world-knowledge image generation
  • MindBench (from Mind-Brush) — size not specified in truncated text — He et al. 2026a, external benchmark for reasoning-intensive visual generation

Baselines vs proposed

  • Nano Banana Pro (direct): Gen-Arena EGIP = 0.21 vs SCOPE: 0.60
  • Nano Banana (direct): Gen-Arena EGIP = 0.07 vs SCOPE: 0.60
  • FLUX.1-dev (direct): Gen-Arena EGIP = 0.01 vs SCOPE: 0.60
  • Qwen-Image (direct): Gen-Arena EGIP = 0.02 vs SCOPE: 0.60
  • Direct (best-of-3): Gen-Arena EGIP = 0.40 vs SCOPE: 0.60
  • Self-refine w/o spec: Gen-Arena EGIP = 0.39 vs SCOPE: 0.60
  • SCOPE w/o R&R: Gen-Arena EGIP = 0.22 vs SCOPE: 0.60
  • SCOPE w/o repair: Gen-Arena EGIP = 0.42 vs SCOPE: 0.60
  • Nano Banana Pro (direct): WISE-V overall WiScore = 0.876 vs SCOPE: 0.907
  • GPT-Image-1.5 (direct): WISE-V overall WiScore = 0.825 vs SCOPE: 0.907
  • Nano Banana Pro (direct): MindBench overall = 0.41 vs SCOPE: 0.61
  • Mind-Brush (He et al. 2026a): MindBench overall = 0.31 vs SCOPE: 0.61
  • Nano Banana Pro: Gen-Arena Entity Pass = 0.82, Gated Constraint Pass = 0.59 vs SCOPE: Entity Pass = 0.92, Gated Constraint Pass = 0.83

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.08043.

Fig 1

Fig 1: Examples generated by SCOPE across knowledge-intensive events, reference-heavy intellectual properties,

Fig 2

Fig 2: Overview of SCOPE. The user prompt is decomposed into an evolving structured semantic specification

Fig 3

Fig 3: summarizes the Gen-Arena construction

Fig 4

Fig 4 (page 5).

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

  • High inference-time cost: SCOPE requires up to 3 full iterations, each involving MLLM calls (Decomposer, Synthesizer, Verifier), image generation/editing, and optionally retrieval — latency and API cost are substantially higher than one-shot generation; no wall-clock time or token-cost figures are reported.
  • Verifier error propagation: the entire repair routing depends on item-level verification accuracy; false negatives (missed failures) leave genuine violations unaddressed, while false positives (hallucinated failures) trigger unnecessary and potentially destructive repairs; verifier calibration is acknowledged but not measured.
  • Closed ecosystem: SCOPE is built entirely on proprietary, non-reproducible backends (GPT-5.4, Nano Banana Pro, Google Search API, Gemini 3-Pro); results cannot be reproduced by independent researchers without identical API access, and performance will change as these models are updated.
  • Gen-Arena dataset is not publicly released (based on available text), limiting community benchmarking and independent validation of the EGIP metric's reliability.
  • No ablation on the number of iterations T or the sensitivity of results to T=3; it is unclear whether gains plateau after T=1 or T=2, or whether extending to T>3 would provide further improvement.
  • The repair modality selection policy (prompt rewrite vs. local edit vs. regeneration) is described qualitatively but the decision logic is not fully specified, making it unclear whether this routing is rule-based, LLM-decided, or heuristic — this is a reproducibility gap.
  • Gen-Arena is human-annotated by the same research group that proposes SCOPE, introducing potential benchmark-to-method alignment bias; no independent annotation validation (e.g., inter-annotator agreement scores) is reported in the available text.

Open questions / follow-ons

  • Verifier calibration and hallucination rates: how often does the Gemini 3-Pro verifier produce false positives or false negatives on entity/constraint checks, and how much of the EGIP ceiling is attributable to verifier error rather than generation failure?
  • Generalization to open-domain prompts: Gen-Arena covers six curated categories; it is unknown whether SCOPE's EGIP advantage holds for truly open-domain user prompts where commitment decomposition is more ambiguous and retrieval targets are less well-defined.
  • Efficiency-accuracy tradeoff: can selective skill invocation (e.g., invoking retrieval only for named entities, skipping repair for low-severity failures) recover most of the EGIP gain while reducing inference cost to near one-shot levels?
  • Scalability of the specification: for very long or compositionally complex prompts (e.g., 20+ entities, 50+ constraints), does the MLLM-driven Decomposer produce reliable specifications, and does the ownership graph remain tractable for verification routing?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, SCOPE is most directly relevant as a demonstration of how structured commitment tracking can enable reliable multi-constraint verification — a capability that mirrors the challenge-response evaluation problem. CAPTCHA systems frequently need to verify that a solver has satisfied multiple simultaneous constraints (select all images containing X where Y is also present, or complete a visual task with specific spatial relationships). The EGIP metric's entity-gated strict-AND pass criterion is essentially a formalization of what a CAPTCHA grader does: a response is only correct if all sub-requirements are simultaneously satisfied. The paper's finding that high item-level pass rates (>83% entity pass for Qwen-Image) can coexist with near-zero instance-level pass rates (EGIP ≈ 0.01) is a direct quantitative illustration of why holistic similarity scores are inadequate for grading complex challenges — a point often made informally in the CAPTCHA literature.

More speculatively, SCOPE's architecture could inform the design of generative CAPTCHA challenges where the goal is to generate prompts that are difficult for current one-shot generators to satisfy under strict multi-constraint evaluation. The ablation results show that even strong frontier models (Nano Banana Pro) fail 79% of instances on Gen-Arena under EGIP, while SCOPE's orchestration framework reaches 60% — suggesting a ~40% residual failure rate even with the best current agentic approach. This gap could be exploited as a difficulty calibration lever: prompts requiring simultaneous entity identity grounding, relational constraints, and layout constraints remain hard even for sophisticated multi-step solvers. However, SCOPE also demonstrates that with sufficient API calls and retrieval access, many such challenges become solvable, which is a warning for CAPTCHA designers who rely on visual complexity alone without restricting solver resources.

Cite

bibtex
@article{arxiv2605_08043,
  title={ SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation },
  author={ Tianfei Ren and Zhipeng Yan and Yiming Zhao and Zhen Fang and Yu Zeng and Guohui Zhang and Hang Xu and Xiaoxiao Ma and Shiting Huang and Ke Xu and Wenxuan Huang and Lionel Z. Wang and Lin Chen and Zehui Chen and Jie Huang and Feng Zhao },
  journal={arXiv preprint arXiv:2605.08043},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.08043}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution