HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

Source: arXiv:2605.26421 · Published 2026-05-26 · By Senyuan Shi, Hao Tan, Zichang Tan, Shuhan Feng, Ajian Liu, Sergio Escalera et al.

TL;DR

This paper addresses the growing challenge of detecting synthetic images generated by diverse generative models, an increasingly urgent problem due to the proliferation of deepfakes and AI-generated media. Existing Synthetic Image Detection (SID) methods predominantly use vision-language models (VLMs) like CLIP with fixed textual prompts as category centers, but suffer from poor adaptability to new, unseen types of image forgery. The authors propose HydraPrompt, a novel asymmetric prompting framework that dynamically adapts category centers based on fine-grained image cues, enabling stronger generalization to diverse synthetic images.

HydraPrompt introduces two key innovations: the Asymmetric Prompt Adapter (APA), which uses a single static prompt set to represent the authentic class but constructs sample-adaptive prompts for the fake class based on shallow visual features; and a Conditional Supervised Contrastive (CSC) loss that pulls real image features together while dispersing fake image features to preserve fine-grained distinctions. Experiments on three challenging SID benchmarks (UniversalFakeDetect, Chameleon, WildRF) demonstrate substantial improvements in accuracy and average precision over state-of-the-art methods, particularly in out-of-distribution scenarios with unseen forgery types. Ablation studies confirm the importance of asymmetric prompts and shallow-layer image features in the APA module as well as the supervised contrastive loss in improving representation separability.

Overall, HydraPrompt proposes a conceptually simple yet effective framework to adaptively model the intrinsic asymmetry between real and fake image distributions in synthetic image detection, significantly advancing the robustness and generalization of VLM-based SID models.

Key findings

HydraPrompt achieves 98.0% accuracy / 99.6% average precision on UniversalFakeDetect GAN datasets, outperforming FatFormer by 1.9% / 3.1% and C2P-CLIP by 2.7% / 2.9%.
On diffusion model datasets from UniversalFakeDetect, HydraPrompt reaches 95.9% accuracy / 99.5% AP, surpassing FatFormer by 2.1% / 0.7%.
HydraPrompt obtains 61.3% and 69.7% accuracy on Chameleon dataset under ProGAN and SD v1.4 training, improving previous best by 2.4% and 7.1%, respectively.
On WildRF online data, HydraPrompt obtains 95.9% accuracy, exceeding prior state-of-the-art by a margin of 6.5% on average, with gains up to +11.5% on Twitter subset.
APA module using shallow-layer (layer 1) image encoder features outperforms deeper features, achieving 98.2% average (Acc/AP) vs 94.7%-96.6% for layer 12/24 features.
CSC objective compacts real image features and disperses fake features, improving unseen forgery separability as visualized in TSNE plots.
Static prompts for authentic images generalize well across diverse real-world datasets (LOKI, MSCOCO), confirming consistent distribution of real images.
HydraPrompt’s computational overhead (FLOPs and latency) is substantially lower than FatFormer while achieving superior accuracy.

Threat model

The adversary is assumed to be an entity capable of generating synthetic images using various generative models, creating diverse and unseen forgery types at inference time. The adversary attempts to evade detection by producing images with subtle, fine-grained manipulation cues. The detection system does not assume prior knowledge of the specific generative models used for forgery at test time and must generalize to out-of-distribution synthetic images. The adversary cannot modify or interfere with the detection model or its learned textual prompts at inference.

Methodology — deep read

Threat Model & Assumptions: The adversary generates synthetic images using various generative models (GANs, Latent Diffusion Models, etc.). The detector has no prior knowledge of specific forgery methods used at inference time and must generalize to unseen forgeries. The model assumes access to labeled real and synthetic images from some generators for training but must robustly distinguish future manipulated content. The adversary cannot manipulate the detector's architecture or its learned prompts at test time.
Data: Training uses ProGAN-generated images with associated labels (real/fake). Evaluation conducted on multiple benchmarks: UniversalFakeDetect (with multiple GAN and diffusion subsets), Chameleon dataset with post-processed images, and WildRF sets from social platforms (Reddit, Twitter, Facebook). Real images come from varied sources such as LOKI, MSCOCO. Preprocessing includes resizing images to 256x256 and cropping to 224x224 for input to the CLIP ViT-L/14 encoder; no data augmentation during training.
Architecture and Algorithm:

Backbone vision-language model is CLIP ViT-L/14 frozen except LoRA modules.
Asymmetric Prompt Adapter (APA) takes shallow-layer image features (1st transformer block) as fine-grained visual cues.
Visual cues undergo average pooling and projection via a lightweight 2-layer adapter (ReLU nonlinearity) producing ˜Z.
Authentic (real) class prompt is static: single set of learnable prompts concatenated with fixed context "A real image".
Fake class prompt is dynamic: concatenates a set of learnable prompts + visual cues (˜Z) + context "A fake image" to produce sample-adaptive textual embeddings.
Text encoder converts concatenated prompt tokens to embeddings serving as category centers T_r (real) and T_f (fake).
During inference, image features z_i are compared against T_r and T_f via cosine similarity logits to classify.
Conditional Supervised Contrastive (CSC) loss encourages real features to cluster tightly (cluster positive pairs) and fake features to remain dispersed (treated as individuals).
A cross-modal alignment term further aligns pairwise image/text features within same categories.
Training objective combines cross-entropy loss on classification logits with CSC and alignment losses, weighted by λ1 and λ2.

Training Regime:

Trained for 10 epochs on RTX 4090 GPU.
Batch size 1000 with memory bank to increase negative samples.
Learning rate 4e-4, decayed by cosine scheduler.
LoRA (low-rank adapter) applied with rank 6 and alpha 6 on all transformer MLPs for effective tuning.
Loss weights λ1 = 1, λ2 = 1.25.

Evaluation Protocol:

Metrics: Accuracy (threshold 0.5) and Average Precision (AP) on real-vs-fake binary classification.
Baselines compared include state-of-the-art SID methods like FatFormer, C2P-CLIP, Effort, RINE.
Held-out unseen generators used for rigorous OOD performance testing.
Ablations to measure impact of APA using different visual cue layers, CSC loss, and static vs adaptive prompts.

Reproducibility:

Model based on public CLIP ViT-L/14.
Code and checkpoints availability not explicitly mentioned in text.
Used publicly available SID benchmarks (UniversalFakeDetect, Chameleon, WildRF).

End-to-end example: Given an input image, the model extracts shallow visual features (layer 1) from CLIP image encoder, pools and projects into fine-grained cues. For real class, it uses static learnable prompts to produce category center embedding. For fake class, it concatenates these cues with learnable prompts to create a sample-adaptive fake embedding. The image feature is compared to both via cosine similarity to assign real or fake label. Backpropagation optimizes prompt parameters to increase margin, supervised by cross-entropy and CSC losses that enhance separability of real/fake clusters. Inference dynamically adapts fake prompt per sample, improving robustness to unseen forgery types.

Technical innovations

Asymmetric Prompt Adapter (APA) that uses static prompts for real images and sample-adaptive prompts for fake images conditioned on fine-grained shallow-layer visual cues.
Conditional Supervised Contrastive (CSC) loss that clusters real image features while dispersing synthetic image features, enhancing intra-class compactness and inter-class separability.
Dynamic construction of fake category textual embeddings by concatenating learnable prompts with image-driven visual cues to enable adaptive decision boundaries for synthetic image detection.
Integration of lightweight LoRA modules for parameter-efficient tuning of VLMs within the proposed asymmetric adaptive prompting framework.

Datasets

UniversalFakeDetect — 20 subsets of synthetic images from multiple GAN and diffusion models — public benchmark
Chameleon — real and synthetic images with wide post-processing artifacts — public benchmark
WildRF — synthetic image detection dataset from social media sources (Facebook, Reddit, Twitter) — public benchmark
LOKI — diverse proprietary real images including medical and remote sensing — private dataset
MSCOCO — general real-world image dataset — public benchmark

Baselines vs proposed

FatFormer: Accuracy = 96.5% / AP = 99.2% (GAN subset UniversalFakeDetect) vs HydraPrompt: 98.0% / 99.6%
C2P-CLIP: Accuracy = 95.3% / AP = 99.6% vs HydraPrompt: 98.0% / 99.6%
Effort: Accuracy = 97.4% / AP = 99.5% vs HydraPrompt: 98.0% / 99.6%
FatFormer: Accuracy = 93.8% / AP = 98.5% (Diffusion subset UniversalFakeDetect) vs HydraPrompt: 95.9% / 99.5%
Best previous on Chameleon (ProGAN training): 58.9% acc vs HydraPrompt: 61.3% acc
Best previous on Chameleon (SD v1.4 training): 62.6% acc vs HydraPrompt: 69.7% acc
Best previous on WildRF: 89.4% acc vs HydraPrompt: 95.9% acc

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.26421.

Fig 1

Fig 1: (a) TSNE [45] visualizations of real and fake images

Fig 2

Fig 2: Analyses on the proposed APA and CSC. (a) Effectiveness of APA: 𝒛𝑖𝑻𝑟and 𝒛𝑖𝑻𝑓are compared for classification during

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Model evaluation focuses on representative SID benchmarks but limited to currently public datasets; real-world forgery types may continue evolving.
Dynamic fake prompts rely on shallow features from one VLM backbone (CLIP ViT-L/14); generalizability to other architectures or modalities not shown.
No explicit adversarial robustness tests, e.g., intentional attempts to fool the adaptive prompt mechanism are not evaluated.
Training uses no data augmentation; potential impact of augmentations or domain adaptation techniques remains unexplored.
Code and pretrained weights availability is unclear, impacting immediate reproducibility.

Open questions / follow-ons

How robust is HydraPrompt to adversarial attacks explicitly crafted to fool the sample-adaptive prompts or CSC loss?
Can the asymmetric adaptive prompting concept extend effectively to other VLM architectures beyond CLIP ViT-L/14 or multi-modal fusion backbones?
How does HydraPrompt perform on synthetic images with extremely subtle or localized manipulations, such as partial-face deepfakes or AI-enhanced real images?
What are the trade-offs between computational cost and detection performance for real-time or resource-limited deployment scenarios?

Why it matters for bot defense

Bot-defense and CAPTCHA practitioners face challenges distinguishing human-generated images from synthetic/AI-generated content, especially as generative models rapidly evolve. HydraPrompt's approach of dynamically adapting fake category prompts conditioned on fine-grained image cues reflects a promising direction to improve detection robustness against unseen forgery types encountered in the wild. Its asymmetric design—using static anchors for real images versus adaptive representations for fakes—aligns well with the asymmetric nature of synthetic content distributions, which is critical for avoiding brittle static boundary methods.

Applying similar adaptive prompting and contrastive learning principles can enhance CAPTCHA defenses by improving classification margins and generalization, making it harder for bots to evade detection with novel generative inputs. However, trade-offs in computational overhead and prompt design complexity need evaluation in latency-sensitive web environments. Integrating HydraPrompt-like modules with existing bot defenses could boost resilience to image-based spoofing without excessive latency.

Cite

bibtex

@article{arxiv2605_26421,
  title={ HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection },
  author={ Senyuan Shi and Hao Tan and Zichang Tan and Shuhan Feng and Ajian Liu and Sergio Escalera and Jun Wan },
  journal={arXiv preprint arXiv:2605.26421},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.26421}
}

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​