Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

Source: arXiv:2605.28091 · Published 2026-05-27 · By Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen et al.

TL;DR

This paper addresses the growing gap between text-to-image (T2I) generation capabilities and their evaluation metrics. While modern T2I models excel at prompt compliance and basic image quality, real-world users, especially creative professionals, require more nuanced assessments of image fidelity to reality and authentic creative expression. Existing benchmarks focus mainly on coarse text-image alignment and aesthetics, saturating on traditional criteria and failing to distinguish leading models on subtle yet crucial aspects like physical plausibility, cultural accuracy, and compositional creativity.

To tackle this, the authors present Qwen-Image-Bench, a large-scale, creator-centric evaluation benchmark co-designed with expert artists. It introduces two novel evaluation pillars—Real-world Fidelity and Creative Generation—alongside conventional dimensions (Quality, Aesthetics, Alignment), organized in a hierarchical taxonomy with 5 top-level pillars, 23 second-level sub-capabilities, and 56 fine-grained rubric-based facets. The benchmark comprises 1,000 bilingual prompts carefully crafted to stress multiple fine-grained aspects simultaneously. Further, they train a unified judge model (Q-Judger) on 130K human expert annotations that scores every image-prompt pair on all 56 facets, producing interpretable, granular diagnostic scores rather than a single opaque metric.

Evaluation across 18 cutting-edge T2I models reveals that Qwen-Image-Bench achieves strong correlation with human judgments (Spearman ρ ≈ 0.92) and successfully distinguishes models on the newly introduced Real-world Fidelity and Creative Generation dimensions where prior benchmarks lack discrimination. Notably, state-of-the-art models still struggle with physical logic, anatomical fidelity, and contact interaction, scoring under 44 on these facets. The benchmark offers a practical, interpretable suite for nuanced T2I model analysis aligned with professional creative workflows.

Key findings

Qwen-Image-Bench defines 5 evaluation pillars: Quality, Aesthetics, Alignment, Real-world Fidelity, and Creative Generation, with 23 sub-capabilities and 56 atomic third-level facets.
1,000 bilingual (English and Chinese) prompts are designed to jointly exercise 3-5 pillars per prompt with balanced coverage of all 56 facets.
The unified judge model (Q-Judger) trained on 130,000 expert-annotated prompt-image pairs achieves Spearman rank correlations >0.89 across all pillars and 0.92 overall versus human rankings of 18 T2I models.
Qwen-Image-Bench enables fine-grained scoring with rubric-grounded 4-point scales (Fail, Pass, Excel, N/A) per facet, allowing detailed diagnosis rather than a single score.
Evaluation of 18 advanced T2I models shows largest inter-model score variance in the two new pillars Real-world Fidelity and Creative Generation, confirming these capture capability gaps missed by prior benchmarks.
Key challenging facets for all models include Physical Logic, Anatomical Fidelity, Animals, and Contact Interaction, with top model scores below 44/100 in these areas.
Prompt length and bilingual design allow analysis of robustness to linguistic complexity and language, addressing real-world usage scenarios.
Q-Judger is publicly released along with the taxonomy and prompts enabling reproducible offline evaluation.

Threat model

The adversary is a modern text-to-image generative model aiming to produce images from complex, multi-faceted prompts exploring creative and real-world fidelity dimensions. The adversary has no direct intent to deceive or manipulate evaluation but represents typical high-capacity production T2I systems whose outputs reflect capability limits rather than adversarial evasion. The framework assumes full access to prompts and images but does not consider attack vectors like prompt injection or model circumvention.

Methodology — deep read

Threat Model & Assumptions: The benchmark assumes an adversary model where modern T2I models generate images from complex textual prompts reflecting real-world creative demands, including nuanced fidelity to physical reality and creative intent. The evaluation does not explicitly model malicious manipulations but targets authentic artistic assessment. Adversaries (models) are commodity generative systems with access to multimodal training but limited to non-adversarial generation.
Data: The prompt set consists of 1,000 bilingual English and Chinese prompts, evenly split into 500 short and 500 long prompts, carefully stratified to balance coverage of all 56 evaluation facets and five pillars. Prompts were drafted by large language models (Qwen3-Max and ChatGPT-5.2) and extensively refined through expert artist review to ensure clarity, discriminative power, and adherence to evaluation rubrics. Annotated training data for the judge consists of over 130,000 prompt-image pairs labeled by 80 professional annotators with art backgrounds, under strict blind and triple review protocols.
Architecture / Algorithm: The unified judge model, Q-Judger, is fine-tuned from the large multimodal foundation model Qwen3.6-27B using the MS-SWIFT multimodal fine-tuning framework. Input is a triplet of prompt, image, and rubric-based checklist encoding third-level facets. Outputs are independent 4-level discrete scores per facet (Fail, Pass, Excel, N/A). The model supports chain-of-thought reasoning mode for better interpretability during scoring.
Training Regime: Q-Judger was fine-tuned on the 130K human-labeled samples balancing model sources and facet coverage. Details such as batch size, epochs, and optimizer specifics are not fully disclosed in the excerpt but use standard MS-SWIFT multimodal tuning approaches. The training involved random seeding and standardized evaluation.
Evaluation Protocol: The authors evaluate 18 state-of-the-art T2I models by generating images for all benchmark prompts, then scoring with Q-Judger. Scores are aggregated bottom-up from facet (L3) to sub-capability (L2), pillar (L1), and overall levels using mean aggregation excluding N/A facets. Spearman rank correlation was computed against human expert rankings across pillars and overall, showing high consistency (ρ up to 0.92). The benchmark enables detailed per-dimension diagnosis and highlights ceiling facets where even the best models score under 44.
Reproducibility: The full prompt set and the Q-Judger model are publicly released on HuggingFace and ModelScope. Taxonomy details and scoring rubric templates are provided. The benchmark supports fully offline evaluation enabling independent replication and future comparative studies.

Example: For the prompt "Create an adorable recipe journal page for stir-fried pork with chili..." annotated experts score images on detailed facets like text rendering, visual storytelling, color harmony, and world knowledge, each independently scored 0, 1, or 2. Q-Judger takes the prompt, image, and facet rubrics to produce the full vector of facet scores. Scores are averaged to identify creative strengths and physical accuracy limitations, providing actionable feedback to model developers.

Technical innovations

A hierarchical, creator-centric evaluation taxonomy with 5 pillars (including novel Real-world Fidelity and Creative Generation pillars), 23 sub-capabilities, and 56 atomic rubric-grounded facets.
An expert-in-the-loop prompt factory producing 1,000 bilingual, stratified prompts that jointly exercise multiple fine-grained facets across pillars, balancing linguistic complexity and language.
A unified judge model (Q-Judger) fine-tuned on 130K human expert-annotated prompt-image pairs yielding fine-grained, rubric-anchored 4-point discrete scores on all 56 facets per sample.
A bottom-up score aggregation method preserving interpretability at the facet, sub-capability, pillar, and overall levels enabling diagnosis of model strengths and weaknesses aligned with professional artistic workflows.

Datasets

Qwen-Image-Bench Prompt Set — 1,000 bilingual prompts (English and Chinese) — publicly released at HuggingFace and ModelScope
Qwen-Image-Bench Annotation Data — 130,000 prompt-image pairs with 56-facet fine-grained human expert labels — internal, not fully public but annotations underpin judge model

Baselines vs proposed

GPT Image 2: Overall score = 64.7, highest among 18 models on Qwen-Image-Bench
Qwen Image 2.0 Pro: Overall score ranks 5th across all evaluated models
Correlation of Q-Judger with human experts: Spearman ρ = 0.92 overall; 0.89 on Quality, Aesthetics, Alignment; 0.92 on Real-world Fidelity and Creative Generation
Key facets Physical Logic, Anatomical Fidelity, Animals, Contact Interaction: best models score below 44/100, indicating systemic challenges

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.28091.

Fig 1

Fig 1: Qwen-Image-Bench Evaluation Dimensions.

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 2

Fig 2: Pipeline for constructing real-world application prompts

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

Limited detail on judge model training hyperparameters and exact fine-tuning protocol, hindering exact replication.
Annotation labor-intensive process and reliance on art academy experts may introduce cultural or domain biases despite bilingual prompts.
Focus on creator-centric artistic evaluation may not fully capture adversarial robustness or manipulation vulnerabilities.
Benchmark prompt set size (1,000) is moderate but possibly insufficient to exhaustively cover all creative scenarios or languages beyond English and Chinese.
Evaluation limited to static images, without consideration of video or interactive multimodal generation modalities.

Open questions / follow-ons

How well does Qwen-Image-Bench generalize to other languages outside of English and Chinese and more diverse creative cultures?
Can the rubric-driven multi-facet scoring model adapt to evolving artistic styles or novel creative paradigms dynamically?
How robust is the judge model to adversarially crafted images intended to fool specific facets of evaluation?
Can this fine-grained diagnostic feedback effectively guide targeted data augmentation and model improvement in a closed-loop training pipeline?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, Qwen-Image-Bench offers valuable insights on constructing highly granular evaluation methodologies grounded in human expert criteria rather than opaque heuristics or single-score metrics. Its hierarchical taxonomy and rubric-driven scoring emphasize interpretability and fine-grained diagnostics over black-box judgments, a principle applicable beyond creative AI to security-critical systems requiring nuanced user intent understanding.

Applying similar creator-centric design to bot detection tasks could enable distinguishing subtle behaviors that aggregate measurements miss, improving reliability. The blend of expert-in-the-loop annotation and multimodal model scoring exemplifies a practical path to scale trusted human standards in complex assessment domains. However, the benchmark’s focus on artistic workflows may require adaptation to security domain specifics, such as adversarial threat modeling and real-time evaluation constraints.

Cite

bibtex

@article{arxiv2605_28091,
  title={ Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation },
  author={ Niantong Li and Guangzheng Hu and Weixu Qiao and Ying Ba and Qichen Hong and Shijun Shen and Jinlin Wang and Fan Zhou and Jianye Kang and Xin Shang and Ziyi He and Wei Wang and Dalin Li and Jiahao Li and Jie Zhang and Kaiyuan Gao and Kun Yan and Lihan Jiang and Ningyuan Tang and Shengming Yin and Tianhe Wu and Xiao Xu and Xiaoyue Chen and Yuxiang Chen and Yan Shu and Yanran Zhang and Yilei Chen and Yixian Xu and Zekai Zhang and Zhendong Wang and Zihao Liu and Zikai Zhou and Hongzhu Shi and Yi Wang and Bing Zhao and Hu Wei and Lin Qu and Chenfei Wu },
  journal={arXiv preprint arXiv:2605.28091},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.28091}
}

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​