GenClaw: Code-Driven Agentic Image Generation

Source: arXiv:2605.30248 · Published 2026-05-28 · By Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen et al.

TL;DR

GenClaw addresses a key limitation of current agentic image generation systems, which rely heavily on black-box end-to-end pixel synthesis models controlled primarily via natural language prompts. These agents often become prompt optimizers stuck in trial-and-error loops without direct canvas manipulation abilities. GenClaw introduces a novel code-driven agentic paradigm that decomposes the image generation process into three human-like stages: conceptualization, sketching via executable code (SVG, HTML, Three.js), and final photorealistic coloring via image generation models. This code-based intermediate representation acts as a transparent, editable "digital brush" enabling precise spatial layout, quantity control, text rendering, physical simulation, and layered editing that pure natural language cannot specify reliably. The pipeline thus transforms image generation from black-box pixel guessing to a white-box staged workflow with intrinsic interpretability and fine-grained control. Experimental results across compositional reasoning benchmarks (GenEval++), long text rendering (LongText-Bench), and editing tasks demonstrate that GenClaw outperforms prior SOTA open-source and proprietary agentic systems, with particular gains in object counting accuracy, spatial reasoning, and text layout fidelity. The layered modular design allows clear fault localization and interactivity. Overall, GenClaw is a pioneering step toward highly controllable, interpretable, and agentic visual synthesis that synergizes LLM coding capabilities with generative image models’ texture synthesis strengths.

Key findings

GenClaw achieves an overall GenEval++ score of 0.878, surpassing previous agentic models like Mind-Brush (0.782) and GenAgent (0.725), and close to proprietary Gemini-3.1 Flash-Image (0.775).
On GenEval++ object counting, GenClaw scores 0.950 compared to 0.700+ for prior agentic models and below 0.950 for top closed-source baselines.
GenClaw improves on spatial relation tasks (Pos/Count 0.925, Pos/Size 0.825) by using explicit SVG-based code layouts, significantly better than text-driven methods scoring below ~0.7.
LongText-Bench results show GenClaw excels in English and Chinese long-text rendering with accuracy 0.989 and 0.988, surpassing strong multimodal models like Qwen-Image (0.943, 0.946) and Gemini-3.0 Pro-Image (0.981, 0.949).
Code-based intermediate sketches enable precise text layout, eliminating spelling errors typical in pure pixel-based generation.
GenClaw’s structured JSONL layered editing reduces unintended pixel corruption during localized editing compared to traditional image editing approaches.
Agent’s review module allows tracing errors to conceptual, code, or pixel layers, improving interpretability and debuggability.
Using Three.js and Python scripts enables physical simulation features like mirror reflection in composite scenes, an advancement beyond black-box generation.

Threat model

n/a — The paper does not focus on adversarial or security threats. The main challenge addressed is inherent ambiguity and lack of explicit operational control in black-box image generation agents rather than adversarial attacks.

Methodology — deep read

Threat model & assumptions: The adversary is not explicitly defined as this is not a security-focused paper, but the system assumes trustworthy inputs and focuses on controllability rather than adversarial robustness. The key challenge addressed is how to empower agents with better operational control over visual structures beyond prompt engineering. The adversary is essentially the inherent ambiguity and stochasticity in natural language driven black-box image generation.
Data: The system is evaluated on several standard image generation benchmarks—GenEval++ (complex composition with counting and spatial relation labels), LongText-Bench (long text rendering in English and Chinese), ImgEdit (editing tasks), and Mind-Bench (world knowledge and reasoning). Exact dataset sizes are not detailed, but these are public or well-known benchmarks. No additional proprietary data appears used. Preprocessing involves parsing user prompts and reference information for the cognitive layer.
Architecture / algorithm: GenClaw is structured into three layers: (1) Cognitive Structuring Layer uses a large language model (Claude-ops-4.6) plus external knowledge and search to parse user intent into structured JSONL records including objects, text, spatial constraints, and reasoning results; (2) Executable Canvas Layer translates these structured records into code sketches using SVG for 2D layout and text, HTML/CSS for text-heavy layouts, Three.js scripts for 3D physical scenarios, and JSONL for layered representations; the code is executed to produce intermediate renderings capturing spatial composition and object counts explicitly; (3) Visual Generation and Review Layer employs an image generation model (Gemini-3.1 Flash-Image) to add textures, materials, and photorealistic details onto the code-generated sketches rather than generating the image from scratch. A VLM-based review module evaluates alignment between the final image and earlier layers to detect semantic or structural errors.
Training regime: The paper focuses on system design and evaluation rather than novel training of foundational models; Claude-ops-4.6 and Gemini-3.1 Flash-Image are used off-the-shelf. Hyperparameters are not detailed. The system uses inference-time code generation and orchestration logic.
Evaluation protocol: Metrics include compositional reasoning accuracy on GenEval++ sub-tasks (Counting, Color, Position, Size, Multi-Count) and overall score; official metrics on LongText-Bench for Chinese/English text rendering; qualitative evaluation on editing and physical simulation tasks. Baselines are strong open-source and closed-source models including GPT-Image, Qwen-Image, Nano-Banana, GenAgent, Mind-Brush, and others. GenClaw’s layered pipeline enables ablation-like analysis by tracing error sources to individual layers. No formal statistical tests are described. Held-out attacker scenarios or distribution shifts are not covered.
Reproducibility: The paper provides a GitHub link with code and demos (https://github.com/yejy53/GenClaw). The backbone LLM (Claude-ops-4.6) and generator (Gemini-3.1 Flash-Image) are closed or proprietary. Datasets used for evaluation are public or standard benchmarks. No frozen weights release is mentioned.

Example end-to-end: A user request for a complex infographic triggers the cognitive layer to parse object counts, text sections, and spatial constraints, retrieving external knowledge as needed. The executable canvas layer generates SVG code representing precise object positions and text layout, which is rendered to produce a sketch intermediate image. This sketch, combined with text layers, conditions Gemini-3.1 Flash-Image to produce a photorealistic final image with rich texture and lighting. The VLM evaluates all outputs verifying consistency. If errors occur, the system diagnoses whether they stem from knowledge retrieval, code generation, or final rendering for targeted revision.

Technical innovations

Introduction of code as an intermediate, executable, and editable canvas representation (SVG, HTML, Three.js) between natural language intent and pixel synthesis, enabling precise spatial and logical control.
Decoupling image generation into a staged pipeline (Conceptualize - Sketch - Color) mimicking human artistic workflows, improving interpretability and controllability compared to black-box prompt-to-image models.
Use of structured JSONL layered representations for localized image editing, reducing collateral pixel corruption and enabling fine-grained interactive modifications.
Integration of physical simulation code (e.g., Three.js mirror reflections) into the generative workflow, allowing explicit modeling of complex physical laws prior to photorealistic rendering.

Datasets

GenEval++ — size unspecified — public compositional scene reasoning benchmark
LongText-Bench — size unspecified — benchmark for English and Chinese long-text rendering
ImgEdit — size unspecified — image editing benchmark
Mind-Bench — size unspecified — world knowledge and reasoning tasks evaluation

Baselines vs proposed

GenEval++ overall: GPT-Image-1.5 = 0.750 vs GenClaw = 0.878
GenEval++ counting: Mind-Brush = 0.700 vs GenClaw = 0.950
LongText-Bench English: Qwen-Image = 0.943 vs GenClaw = 0.989
LongText-Bench Chinese: Gemini-3.0 Pro-Image = 0.949 vs GenClaw = 0.988
Agentic models overall: GenAgent = 0.725 vs GenClaw = 0.878
Physical simulation and layered editing tasks show qualitative improvements; no numeric baseline reported

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30248.

Fig 1

Fig 1: | Code-Driven Agentic Image Generation. Existing methods are bottlenecked by end-to-end black-box

Fig 2

Fig 2: | Showcase of GenClaw in complex scene composition, text rendering, physical simulation, and layered image

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 3).

Fig 3

Fig 3: | Overall architecture of the proposed framework. By emulating the human drawing workflow, the agentic

Fig 4

Fig 4: | Qualitative comparison of instruction following in complex compositions. Compared to purely text-driven

Limitations

The system depends on accurate code generation from LLMs; logical or parsing errors at the code stage can propagate.
Final photorealistic rendering still relies on pretrained image generation models that may hallucinate textures or details beyond code constraints.
Evaluation is limited to available benchmarks and separate manual/qualitative tests; more extensive user or adversarial testing is needed.
The approach requires integration of multiple modules (LLM, VLM, image models, rendering engines), increasing complexity and deployment overhead.
No formal adversarial testing or robustness evaluation under distribution shifts or offensive inputs.
Physical simulation capabilities are demonstrated for simple scenarios; scaling to complex physics or dynamics remains to be investigated.

Open questions / follow-ons

How robust is the code-driven generation pipeline to errors in code synthesis or execution failures, and how to best mitigate cascading failure modes?
Can the approach be extended to support real-time interactive editing with user feedback loops and multi-agent collaboration?
What are the limits of physical simulation integration, and can more accurate or learned physical models be incorporated?
How generalizable is GenClaw's paradigm to other modalities beyond images, such as video or 3D content generation?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, GenClaw demonstrates a promising direction by leveraging executable intermediate representations that provide transparent, editable structure to generative workflows. By incorporating code-level control over spatial layout, text rendering, and layered editing, such systems enable higher-fidelity, less hallucination-prone synthetic images with interpretable provenance. This staged architecture could reduce unintended generation ambiguities attackers might exploit—such as text misrendering or object counting errors—and improve defenses by enabling traceable generation steps rather than opaque black-box synthesis. However, the reliance on code-generation correctness and final texture synthesis by image models still presents attack surfaces. Overall, the paper invites bot-defense engineers to reconsider generation pipelines from purely linguistic to structured code-conditioned systems, potentially raising the bar for reliable, attack-resistant image synthesis relevant to CAPTCHAs and bot detection.

Cite

bibtex

@article{arxiv2605_30248,
  title={ GenClaw: Code-Driven Agentic Image Generation },
  author={ Junyan Ye and Jun He and Zilong Huang and Dongzhi Jiang and Xuan Yang and Rui Chen and Weijia Li },
  journal={arXiv preprint arXiv:2605.30248},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30248}
}

GenClaw: Code-Driven Agentic Image Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​