FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection

Source: arXiv:2605.12826 · Published 2026-05-12 · By Kaixiang Zhao, Tianrun Yu, Aoxu Zhang, Junhao Su, Porter Jenkins, Amanda Hughes

TL;DR

This paper addresses the challenge of robustly detecting and localizing image manipulations caused by sophisticated editing and generative AI by leveraging multiple diverse forensic algorithms rather than relying on any single detector. Prior methods either focus on handcrafted cues with limited generalization and fragmented evidence or end-to-end deep models that are opaque and trained on large datasets. FRAME (Forensic Routing and Adaptive Multi-path Evidence Fusion) introduces a novel adaptive framework that treats combinations of existing forensic algorithms as candidate analysis paths within a supernet-inspired modular structure. For each input image and suspected manipulation type, FRAME scores and selects the most informative forensic paths using a learned graph neural network selector, then fuses their complementary outputs into a unified heatmap for detection and localization. This strategy explicitly addresses heterogeneous evidence reliability, noisy outputs, and fixed fusion limitations in prior work. Extensive experiments on multiple public manipulation benchmarks show FRAME outperforms fixed or heuristic fusion baselines built from the same algorithms by large margins (e.g. +0.079 F1 on CASIA v1 localization), and achieves state-of-the-art pixel-level localization and image-level detection compared to recent deep learning methods, despite training only on a single dataset CASIA v2 and using a lightweight selector and fusion module. The authors also provide a rigorous theoretical analysis proving their learned adaptive selection strictly improves over uniform fusion and the best single forensic method under reasonable assumptions. The open-source implementation enables reproducibility and practical adoption. Overall, FRAME advances forensic reasoning beyond single detectors and opaque ensembles by a principled combination of adaptive multi-path routing and evidence fusion for image forgery analysis.

Key findings

FRAME improves detection AUC by +0.067, F1 score by +0.079, and mean IoU by +0.065 over the best learned ensemble baseline (XGB-Ensemble) on CASIA v1 dataset.
FRAME achieves the highest detection AUC on CASIA v1, Coverage, and RealisticTampering datasets compared to four recent deep learning baselines (TruFor, MMFusion, ManTraNet, CAT-Net).
FRAME outperforms the strongest deep baseline TruFor by +0.017 detection AUC and +0.013 F1 score on CASIA v1 localization, with similar gains on Coverage and RealisticTampering datasets.
The adaptive path selector is implemented as a lightweight 3-layer GraphSAGE network trained with MSE loss to predict path performance scores.
FRAME samples K=50 candidate forensic paths and fuses the top k=5 selected path outputs using learned fusion weights, improving over top-1 selection and uniform fusion ablations.
Theoretical analysis proves that, under bounded loss and no-tie assumptions, a sufficiently accurate learned selector strictly outperforms uniform fusion and any best fixed single algorithm baseline in expectation.
FRAME maintains interpretability by preserving individual forensic module contributions while fusing outputs.
FRAME processes diverse forensic algorithms from the pyIFD toolkit (e.g., ELA, DCT, PRNU) but does not retrain the underlying algorithms.

Threat model

The adversary is a forger who creates manipulated images changing semantic or pixel-level content possibly using various unknown editing or generative methods. The attacker may apply post-processing like compression. FRAME assumes no prior knowledge of the attacker’s exact method but relies on multiple forensic algorithm outputs as evidence. The attacker cannot fully hide all forensic cues detected by these diverse algorithms. FRAME does not require retraining base forensic tools and thus assumes the attacker does not evade all these tools simultaneously. FRAME aims to robustly detect and localize manipulations given the mixed evidence.

Methodology — deep read

FRAME addresses the image manipulation detection problem by modeling it as a context-adaptive path selection and evidence fusion task over a pool of diverse forensic algorithms.

Threat model and assumptions: FRAME assumes an adversary who may have manipulated images using unknown methods possibly with compression or post-processing, but the system does not rely on retraining forensic algorithms or require knowledge of the manipulation method upfront. The system uses image-level and manipulation type metadata as context.
Data: Training uses the CASIA v2 dataset with 8,831 images for training and 1,918 images for validation. Labels include pixel-level manipulation masks. Evaluation is performed on four external datasets (CASIA v1: 1,754 images, Coverage: 200 images, Columbia: 363 images for image-level detection only, RealisticTampering: 440 images) for generalization testing. The datasets cover a variety of splicing, copy-move, and realistic tampering scenarios.
Architecture/algorithm: The forensic supernet is defined as a directed acyclic graph of modular forensic algorithms from pyIFD (e.g., Error Level Analysis, Discrete Cosine Transform analysis, Noise-based methods). Each forensic module produces an output heatmap or score.

Forensic analysis paths are subgraphs representing subsets of these modules combined. FRAME samples candidate paths from the supernet.
A learned selector implemented as a 3-layer GraphSAGE graph neural network takes as input a graph embedding of each path combined with image features and manipulation metadata. It is trained to predict the performance score (e.g., pixel-level F1 or IoU) of the path on validation images.
Candidate paths are scored by the selector, and the top-k paths (k=5 in main experiments) are executed on the test image, producing forensic outputs.
The outputs from selected paths are fused via weighted linear combination with learned fusion weights to produce a final heatmap for detection and localization.

Training regime: To create data for training the selector, K=50 candidate paths are sampled for each training image. Each path is executed, and its output is scored against ground truth masks to generate supervised scores. The selector is trained to minimize mean squared error between predicted and observed path scores on the validation set from CASIA v2.
Evaluation protocol: Evaluation uses pixel-level F1, mean IoU for localization, and image-level AUC and accuracy for detection. Different baselines are compared, including handcrafted single forensic methods, uniform and heuristic fusions of handcrafted cues, two learned ensemble baselines (Random Forest and Gradient Boosted Trees) trained on the same pyIFD outputs, and recent deep learning state-of-the-art detectors. External test sets are held out to measure generalization. Ablations test components like learned selector versus fusion, and top-1 versus top-k fusion.
Reproducibility: The authors release code and pretrained weights publicly. The forensic module pool uses existing untrained pyIFD algorithms. The selector and fusion modules are lightweight and trained only on CASIA v2 split, enabling straightforward reproduction.

Concrete example: For a test image from CASIA v1, FRAME samples 50 candidate forensic paths (e.g., ELA+DCT, PRNU+Ghost modules). The selector scores each path based on its learned model. The top 5 scoring paths are executed to produce heatmaps highlighting suspected manipulated regions. These heatmaps are fused with learned weights to form a final manipulation localization mask. This final output is evaluated against ground truth to compute pixel-level metrics. Across the dataset, this pipeline consistently improves over baselines that apply fixed or uniform combinations of modules.

Technical innovations

Framing multi-algorithm forensic analysis as a supernet-inspired modular system where analysis corresponds to adaptive path selection over heterogeneous forensic methods, enabling context-dependent routing.
Using a Graph Neural Network (GraphSAGE) predictor to score forensic analysis paths prior to execution, learning to estimate their image- and manipulation-specific effectiveness.
Developing an adaptive fusion mechanism that aggregates outputs from multiple selected forensic paths with learned weights, balancing complementary evidence while preserving interpretability.
Providing a theoretical framework with provable guarantees that under natural conditions, learned adaptive path selection strictly outperforms uniform fusion and best single algorithm baselines.

Datasets

CASIA v2 — 8,831 training + 1,918 validation images — publicly available image manipulation benchmark
CASIA v1 — 1,754 images — publicly available, test set for benchmarks
Coverage — 200 images — focused on copy-move forgeries, publicly available
Columbia — 363 images — publicly available, used for image-level detection evaluation only
RealisticTampering — 440 images — publicly available, realistic splicing across cameras

Baselines vs proposed

Best single pyIFD module: detection AUC < FRAME detection AUC by approximately 0.10 on CASIA v1
Uniform fusion of all pyIFD modules: detection AUC of 0.682 vs FRAME 0.749 on CASIA v1
Heuristic selection + uniform fusion: detection AUC 0.710 vs FRAME 0.749 on CASIA v1
RF-Ensemble (Random Forest on pyIFD outputs): detection AUC 0.734 vs FRAME 0.749 on CASIA v1
XGB-Ensemble (Gradient Boosting on pyIFD outputs): detection AUC 0.742 vs FRAME 0.749 on CASIA v1
TruFor [12]: detection AUC 0.732 on CASIA v1 vs FRAME 0.749
MMFusion [23]: detection AUC 0.715 on CASIA v1 vs FRAME 0.749
ManTraNet [28] and CAT-Net [19]: lower detection/localization metrics on CASIA v1 compared to FRAME
Ablation top-1 selection only: F1 and IoU lower by ~0.025 compared to full FRAME with top-5 fusion

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12826.

Fig 1

Fig 1: Overview of FRAME. Given an input image, FRAME samples candidate paths from a forensic supernet, scores them with a

Fig 2

Fig 2 (page 4).

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

FRAME relies on a fixed pool of existing forensic algorithms; it does not improve or retrain these base modules.
The lightweight selector and fusion modules are trained on only one dataset (CASIA v2), which may limit adaptation to very novel manipulations or domains.
Evaluation focuses on typical image manipulations; robustness against adversarial forgeries or increasingly sophisticated generative manipulations is not fully tested.
The theoretical guarantees assume bounded loss and sufficient selector accuracy, which may be challenging in distribution shift or complex real-world scenarios.
Computational cost includes running multiple forensic algorithms per image, potentially limiting real-time or large-scale deployment.
Current fusion is linear weighted; nonlinear or more complex fusion strategies are unexplored.

Open questions / follow-ons

How well does the adaptive selection and fusion generalize to novel or adversarial manipulation techniques beyond current datasets?
Can learning-based approaches be integrated into the forensic supernet pool to augment or replace handcrafted forensic algorithms without loss of interpretability?
Would nonlinear or dynamic fusion mechanisms further improve evidence integration beyond the current weighted linear fusion?
How to efficiently scale FRAME for real-time or large-scale deployment given the computational cost of executing multiple forensic modules?

Why it matters for bot defense

FRAME’s approach is highly relevant to bot-defense and CAPTCHA systems that require reliable detection of image authenticity or manipulation. Its framework for adaptively routing forensic signals based on input context, combined with evidence fusion, can inform defenses that need to verify media integrity under adversarial conditions. The modular design allows practitioners to incorporate diverse image forensics methods instead of relying on a single detector, addressing heterogeneous attack patterns. The theoretical grounding provides confidence that adaptive selection can outperform fixed aggregation methods commonly used. However, given the computational overhead, practical CAPTCHA implementations may need lightweight or approximated variants. Also, since FRAME emphasizes interpretability by preserving individual module contributions, this can help forensic analysts understand suspicious inputs more transparently than black-box deep detectors. Overall, FRAME highlights the utility of adaptive, multi-path fusion frameworks for robust image manipulation detection in security-critical downstream applications like bot defense.

Cite

bibtex

@article{arxiv2605_12826,
  title={ FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection },
  author={ Kaixiang Zhao and Tianrun Yu and Aoxu Zhang and Junhao Su and Porter Jenkins and Amanda Hughes },
  journal={arXiv preprint arXiv:2605.12826},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12826}
}

FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​