REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

Source: arXiv:2605.28459 · Published 2026-05-27 · By Jun Zhou, Bingwen Hu, Yaxiong Wang, Zhedong Zheng, Yongzhen Wang, Yuchen Zhang et al.

TL;DR

This paper tackles the problem of multimodal manipulation detection, specifically for forged image–text news pairs where existing methods fail to generalize well due to reliance on memorized artifact patterns and sensitivity to subtle or imperceptible manipulations. The authors propose a novel formulation as a reference-grounded verification task, inspired by human comparative reasoning, where a given query pair is verified by retrieving semantically relevant authentic references and detecting discrepancies. The key technical contributions include a large-scale reference library with 170K authentic image–text pairs over 40K public figures, an Authenticity Conditioned Cross Attention (ACCA) mechanism to fuse query and reference features capture fine-grained inconsistencies, and a task-decoupled Mixture-of-Experts (MoE) architecture to jointly perform classification and fine-grained grounding while mitigating task conflicts. Extensive experiments on multiple datasets show that REVEAL dramatically outperforms state-of-the-art methods by large margins in binary classification, multilabel classification, image grounding, and text grounding, while also enabling training-free domain adaptation via plug-and-play reference library updates. This represents a shift from isolated artifact detection to external reference-based verification, improving robustness, interpretability, and generalization against evolving misinformation.

Key findings

REVEAL achieves 97.82% AUC and 3.75% EER on the DGM4 dataset, outperforming HAMMER's 93.19% AUC and 14.10% EER by +6.92% average across multimodal tasks.
On the SAMM dataset with different manipulation pipelines, REVEAL attains 99.83% AUC and 0.65% EER, reducing EER from the prior best RamDG's 5.42% by 4.77 points.
Cross-domain generalization tests show that REVEAL surpasses baseline HAMMER, HAMMER++, and ASAP by +10.2% ACC and +15.8% IoUm on average when trained on a single news source and evaluated on others.
Zero-shot cross-dataset transfer evaluation on MDSM shows REVEAL improves average ACC by +4.97%, mAP by +11.82%, and mIoU by +6.97% over next-best baselines without fine-tuning.
Ablations demonstrate that adding visual reference stream to baseline increases AUC from 90.07% to 95.78% (+5.71%) and image grounding IoUm from 79.21% to 84.14% (+4.93%).
Incorporating text reference stream further improves text grounding F1 by +6.03%, showing synergy of multimodal references.
Introducing task-aware MoE experts yields additional consistent gains across classification and grounding metrics, mitigating conflicting multi-task optimization.
REVEAL enables training-free domain adaptation by updating reference library without tuning model parameters.

Threat model

The threat model assumes an adversary aiming to produce forged image–text pairs to spread misinformation, including coordinated cross-modal manipulations that may evade detection by traditional artifact-based methods. The adversary can manipulate visual or textual content such that intrinsic artifacts are minimal or imperceptible. The defender's system can retrieve authentic reference pairs from a large-scale external library to perform comparative verification but assumes the adversary cannot simultaneously manipulate or poison the reference library. The adversary's capabilities do not extend to compromising the retrieval indexing or modifying stored references.

Methodology — deep read

The authors reformulate multimodal manipulation detection as a reference-grounded verification problem. The input is a query image–text pair (I, T) suspected of manipulation. Instead of predicting authenticity solely from internal artifacts, the model retrieves authentic reference pairs (Iref, Tref) from an external gallery containing 170K authentic news image–text pairs covering 40K public figures. The retrieval ranking is based on cosine similarity of visual embeddings from a visual encoder, optimized with a Visual Retrieval Contrastive (LVRC) loss during training to bring authentic pairs closer in embedding space. The retrieval gallery is dynamic during training (memory bank) and static offline for inference (indexed with FAISS for efficiency). The core detection network first encodes the query and the retrieved references. Then the Authenticity Conditioned Cross Attention (ACCA) module performs cross-modal attention for both visual and textual streams: textual tokens in the query attend to the retrieved reference caption tokens producing discrepancy-aware textual features, regularized by a Reference-Guided Attention Supervision (LRGAS) loss to encourage alignment for authentic tokens and broader context attention for manipulated tokens. For the visual modality, element-wise subtraction models query-reference discrepancies in feature space, which after transformation yield manipulation-aware visual representations. These textual and visual discrepancy features are fused further by cross-attention to support joint reasoning. To jointly address global manipulation detection and fine-grained multimodal grounding, the authors design a task-decoupled Mixture-of-Experts (MoE) architecture. Each task (binary classification BIC, multi-label classification MLC, image manipulation grounding IMG, text manipulation grounding TMG) has its own pool of feed-forward experts specializing in the task-specific feature patterns, isolated to mitigate optimization conflicts. The MoE uses task-aware routing conditioned on input features concatenated with learnable task embeddings to weight top-K experts per task at forward pass. For TMG, a reference-enhanced teacher network directly processes concatenated query and reference text pairs to produce fine-grained token-level supervision via distillation, improving sensitivity to subtle text manipulations. The overall training objective combines task-specific losses (cross-entropy for classification, L1 and Generalized IoU for image grounding, class-weighted cross-entropy for text grounding) with the retrieval contrastive loss, the attention guidance loss, and the distillation loss, balancing them with hyperparameters α and β. Training proceeds with standard mini-batches on the DGM4 dataset partitions to avoid data leakage—each training sample only retrieves references from the training partition's authentic pairs. Retrieval quality improves throughout training via the contrastive loss. The model uses a multimodal encoder backbone similar to BEiT-3 to produce unified image-text embeddings. At inference, the offline retrieval gallery is queried for authentic references given the input query image embedding. The fused discrepancy features sent through the task-adaptive MoE yield authenticity predictions and localization heatmaps. Evaluation covers multiple datasets including DGM4, SAMM, and MDSM. Cross-domain and zero-shot cross-dataset generalization tests evaluate robustness. Ablation tests isolate the impact of visual reference, text reference streams, and MoE architecture by progressively enabling components on the baseline artifact-only model. Training dynamics plots show faster convergence when using MoE. The code for the method and large-scale reference database is publicly released to facilitate reproducibility. Overall, the methodology realizes a new paradigm of retrieval-augmented multimodal verification that (1) mitigates brittle artifact memorization by grounding in external authentic references, (2) models fine-grained query-reference inconsistencies via cross-attention fusion, and (3) jointly optimizes global detection and local grounding with task-specific adaptive experts.

Technical innovations

Formulating multimodal manipulation detection as a reference-grounded verification task comparing query pairs against retrieved authentic references to expose query-reference inconsistencies.
Constructing a large-scale universal reference library of 170K authentic news image–text pairs spanning over 40K public figures for retrieval-augmented verification.
Authenticity Conditioned Cross Attention (ACCA) module that fuses query and reference features using cross-modal attention with reference-guided attention supervision to yield discrepancy-aware representations.
Task-decoupled Mixture-of-Experts (MoE) architecture with task-aware routing to handle heterogeneous objectives of global binary/multilabel classification and fine-grained image/text grounding, reducing task interference.
Reference-enhanced teacher expert for text manipulation grounding that distills fine-grained token-level discrepancy knowledge into the student to improve sensitivity to subtle semantic manipulations.

Datasets

DGM4 — size unspecified — public benchmark for multimodal manipulation detection with binary classification, multi-label classification, image and text grounding annotations
SAMM — size unspecified — dataset with different manipulation pipelines and original SAMM image reference gallery
MDSM — size unspecified — cross-dataset for zero-shot transfer evaluation
VisualNews — 170K authentic news image–text pairs — constructed reference library used for retrieval

Baselines vs proposed

HAMMER: binary classification AUC = 93.19%, EER = 14.10%, ACC = 86.39% vs REVEAL: AUC = 97.82%, EER = 3.75%, ACC = 93.18% on DGM4
RamDG: EER = 5.42% vs REVEAL: EER = 0.65% on SAMM dataset
ASAP: average accuracy = 79.06% vs REVEAL (Our-Top2): 87.11% on cross-domain generalization
HAMMER++: average cross-dataset ACC = 63.39% vs REVEAL: 71.71% on MDSM zero-shot transfer

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.28459.

Fig 1

Fig 1: Artifact-centric vs. Reference-grounded

Fig 2

Fig 2: Overview of the proposed REVEAL framework. (a) Primary Pipeline: (a-a) Reference Retrieval

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

The method relies on availability and coverage of a large-scale authentic reference library, which may be domain-specific or incomplete, limiting applicability to novel or obscure public figures or events.
Evaluation focuses on specific news image–text pairs datasets; generalization to other multimodal manipulation domains (e.g., social media memes, video-text pairs) remains untested.
Although training-free adaptation via reference updates is demonstrated, dynamic scenarios with rapidly evolving content may require continual curation of references.
The retrieval module depends on visual embeddings; failures in retrieval quality impact verification performance, especially under challenging domain shifts or adversarial manipulations.
The model is evaluated under standard benchmarks lacking detailed adversarial threat analysis or robustness against strong adaptive attackers explicitly targeting the retrieval-augmented verification pipeline.

Open questions / follow-ons

How can the reference library be automatically updated and curated in real time to handle fast-emerging events or figures?
Can the framework extend to video–text or audio–text multimodal manipulation detection requiring temporal grounding?
How robust is the retrieval-augmented verification under adaptive adversarial attacks targeting both the retrieval process and discrepancy modeling?
What is the impact of noisy or partially incorrect reference data on the verification accuracy and interpretability?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, REVEAL introduces a promising approach to detecting sophisticated disinformation involving coordinated image–text manipulations by grounding detection in external authentic evidence rather than solely on intrinsic artifact patterns. This retrieval-augmented verification framework could inspire design of systems that verify suspicious content by cross-referencing it against trusted source databases dynamically, improving robustness to evasive or imperceptible manipulations. The training-free adaptation enabled by updating reference sets can help maintain system efficacy against evolving attack vectors without costly retraining. Additionally, the dual-stream discrepancy modeling and expert routing offer modular design patterns applicable to multimodal bot detection pipelines. However, applying this approach requires constructing and maintaining a comprehensive, trustworthy reference library relevant to the application domain and ensuring secure retrieval mechanisms to prevent adversarial interference.

Cite

bibtex

@article{arxiv2605_28459,
  title={ REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection },
  author={ Jun Zhou and Bingwen Hu and Yaxiong Wang and Zhedong Zheng and Yongzhen Wang and Yuchen Zhang and Ping Liu },
  journal={arXiv preprint arXiv:2605.28459},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.28459}
}

REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​