AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Source: arXiv:2604.28177 · Published 2026-04-30 · By Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao et al.

TL;DR

AEGIS introduces a comprehensive benchmark designed specifically for forensic analysis of AI-generated academic images, addressing a critical gap left by existing generic forensic datasets. It advances the field through three key innovations: (1) detailed coverage of seven academic image categories with 39 fine-grained subtypes, reflecting the unique structural and semantic complexity of scholarly visuals; (2) simulation of four common and subtle forgery strategies across 25 state-of-the-art generative models to stress-test forensics against evolving adversarial threats; and (3) a multi-dimensional evaluation framework assessing forgery detection, textual artifact recognition, manipulation classification, and tampering localization, thus aligning with real-world academic review workflows. In evaluations, even powerful multimodal LLMs like GPT-5.1 struggled to exceed 48.8% overall performance and localization IoU was limited to 30.09%, evidencing a profound forensic challenge in this domain. Multimodal LLMs demonstrated complementary strengths such as 84.74% accuracy in textual artifact recognition compared to expert models' peak 79.54% in binary authenticity detection, underscoring the need for integrated reasoning and sensing approaches. The benchmark exposes a significant gap between the rapid advancement of generative models and current forensic capabilities, motivating development toward expert-level AI forensic agents.

Key findings

GPT-5.1 achieves only 48.80% overall performance on AEGIS with localization IoU of 30.09%, indicating limited ability to analyze academic image forensics.
Among expert vision-only forensic models, the highest pixel-level IoU for tampering localization is 30.09%, showing spatial attribution remains a major challenge.
11 generative models yield average forensic accuracy below 50%, with 4 falling below 30%, demonstrating that generative advances outpace forensic detection abilities.
Multimodal large language models (MLLMs) attain 84.74% accuracy in textual artifact recognition and 60.07% accuracy in manipulation classification, outperforming some expert detectors in reasoning tasks.
Expert detectors achieve up to 79.54% accuracy in binary forgery detection but exhibit lower robustness to post-processing perturbations than MLLMs.
Performance significantly degrades on dense, texture-rich academic categories like stained micrographs and medical images compared to more structured types like charts and diagrams.
Few-shot prompting improves detection task accuracy but impairs fine-grained manipulation classification, revealing a trade-off between pattern recognition and multi-step reasoning.
Chain-of-thought prompting boosts detection accuracy but reduces manipulation classification performance, suggesting challenges for hypothetical reasoning about forgery types.

Threat model

The adversary is an academic image forger who uses advanced generative models to fabricate or subtly manipulate scholarly images with the goal of evading forensic detection during academic peer review or publication. They may produce entire forged images or localized alterations, employing diverse forgery strategies such as textual fabrication and regional editing. The adversary does not have access to detection model internals but aims to maximize visual fidelity and semantic coherence. They cannot manipulate peer review metadata or external evidence beyond the image content itself.

Methodology — deep read

The study begins by establishing a threat model focused on AI-generated forgeries of academic images used in high-stakes scholarly publications, assuming adversaries use advanced generative models to fabricate or locally modify images with subtlety. The adversary aims to evade detection via diverse forgery strategies, but is constrained from, for example, systemic domain knowledge manipulation beyond the image domain.

AEGIS constructs its dataset by parsing over 4,000 open-access academic papers from PubMed Central to extract figures and panels, then manually curating a high-quality set of 8,210 academic image panels classified into seven categories (Chart, Medical Imaging, Physical Object, Micrograph, Stained Micrograph, Diagram, Others) and 39 fine-grained subtypes to capture domain complexity. This hierarchical taxonomy enables forensic models to be evaluated on realistic scholarly images rather than generic scenes.

Four representative forgery strategies—Text Constraint Fabrication (TCF), Image Inference Forgery (IIF), Targeted Region Restoration (TRR), and Targeted Region Editing (TRE)—are algorithmically simulated using 25 modern generative models spanning diffusion, hybrid, and unified multimodal architectures. These generate over 7,000 synthetic forgeries, carefully verified through a dual-review with expert annotators to ensure plausibility and visual fidelity.

AEGIS defines four forensic evaluation tasks: Forgery Scope Discrimination (coarse no/partial/entire forgery classification), Textual Artifact Recognition (binary assessment of AI-generated text traces), Manipulation Classification (structural reasoning about insert/remove/alter manipulations), and Tampering Pinpointing (spatial localization via bounding boxes for MLLMs and pixel masks for experts). Task-specific metrics include Accuracy, Macro-F1, region-level Correct Localization Accuracy, pixel-level IoU and F1, and a composed Normalized Forensic Index (NFI) to balance capabilities across tasks.

The benchmark evaluates 14 proprietary and 11 open-source MLLMs, a unified multimodal generation-understanding model, six vision-only expert detectors, and three hybrid MLLM-assisted models. Models are tested under zero-shot, chain-of-thought, and few-shot prompting conditions using standardized high-resolution PNG inputs to avoid compression confounds. Ablations include impact analysis of post-processing perturbations (Gaussian blur, JPEG compression, scaling) and prompting strategies.

Examples like manipulation classification follow the pipeline: input includes an image with a highlighted red region and original caption; the model outputs the manipulation type (insert/alter/remove/not sure). Localization predictions use bounding boxes or pixel masks compared against ground truth annotations to compute IoU/F1. The extensive annotation and evaluation setup thus simulates end-to-end academic forensic review workflows with multi-dimensional judgments.

The authors release the benchmark code under Apache 2.0 and data under CC BY 4.0, promoting reproducibility. However, some recent forensic models without released weights are not covered. Overall, AEGIS systematically diagnoses forensic model weaknesses along interpretability, localization, reasoning, and robustness dimensions specifically tailored for complex academic imagery.

Technical innovations

A hierarchical taxonomy organizing academic images into seven high-level categories and 39 fine-grained subtypes to capture domain-specific complexity.
Simulation of four diverse academic forgery strategies (TCF, IIF, TRR, TRE) using 25 generative models representing diffusion, hybrid, and unified multimodal architectures.
A multi-dimensional forensic evaluation framework spanning forgery detection, textual artifact recognition, manipulation classification, and tampering pinpointing with task-specific metrics.
Introduction of a Normalized Forensic Index (NFI) metric to quantify balanced forensic capability across multiple tasks rather than peak performance in any single dimension.

Datasets

AEGIS — 8,210 academic image panels with 20k forensic questions — constructed from 4,000+ open-access papers via PubMed Central

Baselines vs proposed

GPT-5.1: Normalized Forensic Index = 48.80% vs other MLLMs averaging below 40%
Expert Detector AIDE: Binary authenticity detection accuracy = 79.54%
Expert Detector DDA: Pixel-level IoU up to 30.09%, with only 2 MLLMs exceeding 50% Correct Localization Accuracy (CLA)
Generative Model Nano Banana Pro: Forensic accuracy < 30%, exposing adversarial vulnerability
Multimodal LLM Gemini 3 Pro Preview: Textual Artifact Recognition accuracy = 84.74%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.28177.

Fig 1

Fig 1: AEGIS investigates whether current models

Fig 2

Fig 2: Hierarchical taxonomy of AEGIS. We organize academic images into seven categories and 39 fine-

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Limited availability of confirmed real-world academic image retraction cases restricts scale and diversity of genuine misconduct samples.
Some recent state-of-the-art forensic models (e.g., AIGI-Holmes) lack released pretrained weights, limiting baseline coverage.
Localization remains challenging with localization IoU scores low (~30%), indicating models struggle with precise spatial tampering attribution.
Post-processing perturbations degrade expert models substantially, revealing robustness constraints.
Few-shot and chain-of-thought prompting strategies show trade-offs, complicating prompt tuning for multi-task forensic reasoning.
Dataset primarily synthetic forgery simulations with limited ground-truth from confirmed real forgeries, possibly limiting ecological validity.

Open questions / follow-ons

How can models better integrate dense texture and domain knowledge to improve tampering localization in complex scientific imagery?
What hybrid architectures effectively combine expert sensor models and multimodal reasoning agents to close the detection-classification-interpretation gap?
How can real-world academic image forgery cases be systematically collected and incorporated to enhance ecological validity beyond synthetic simulations?
What self-supervised or continual learning strategies allow forensic models to adapt rapidly to evolving generative models without extensive retraining?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, AEGIS highlights the complexity of domain-specific image forensics beyond typical social media or natural image scenarios, illustrating that even state-of-the-art multimodal models struggle with subtle, localized forgery detection in structured scientific visuals. It suggests that designing bot defenses relying solely on single-dimensional authenticity judgments can be insufficient when forgeries are advanced and diverse. Instead, a multi-faceted approach—integrating detection, reasoning on textual artifacts, manipulation classification, and precise tampering localization—is necessary. Lessons from AEGIS could inform CAPTCHA designs or bot-detection systems requiring fine-grained image authenticity reasoning under adversarial generation conditions. Moreover, the benchmark’s multi-task evaluation protocol and composite forensic index provide useful templates for measuring robustness and interpretability in other domains that rely on subtle image forensics.

Cite

bibtex

@article{arxiv2604_28177,
  title={ AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images },
  author={ Bo Zhang and Tzu-Yen Ma and Zichen Tang and Junpeng Ding and Zirui Wang and Yizhuo Zhao and Peilin Gao and Zijie Xi and Zixin Ding and Haiyang Sun and Haocheng Gao and Yuan Liu and Liangjia Wang and Yiling Huang and Yujie Wang and Yuyue Zhang and Ronghui Xi and Yuanze Li and Jiacheng Liu and Zhongjun Yang and Haihong E },
  journal={arXiv preprint arXiv:2604.28177},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.28177}
}

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​