Skip to content

CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection

Source: arXiv:2606.20302 · Published 2026-06-18 · By Giovanni Affatato, Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro

TL;DR

This paper introduces CUPID, a novel Person-of-Interest (POI) video deepfake detection method that leverages 3D Morphable Model (3DMM)-based UV texture maps combined with a Masked Autoencoder (MAE) for self-supervised representation learning. Unlike prior POI detectors that require training on deepfake videos or on the specific POI, CUPID trains only on real videos of diverse identities, making it identity-agnostic and obviating the need for POI data during training. At inference, embeddings extracted from test videos of the POI are compared against a reference set of pristine POI videos to detect manipulations. Additionally, operating in the UV texture space enables spatial interpretability by highlighting facial regions responsible for classification decisions.

The method was extensively evaluated on four deepfake datasets (DF-TIMIT, FakeAVCeleb, KoDF, DeepSpeak) under both high- and low-quality compression settings. CUPID outperformed state-of-the-art POI and generic deepfake detectors on most datasets, exhibiting particular robustness to post-processing such as downscaling and compression, and provided faster inference. The interpretable residual maps delivered localized explanations of manipulations, a rare feature among deepfake detectors. This work contributes a practical, efficient, and interpretable framework suitable for POI deepfake detection even when the POI is unseen in training.

Key findings

  • CUPID does not require any deepfake videos or POI identity during training, relying solely on real videos of various identities.
  • By leveraging UV texture maps, CUPID achieves semantic pixel-level correspondence across facial regions, enhancing robustness and interpretability.
  • Evaluations on four datasets show CUPID outperforms state-of-the-art POI detectors on DF-TIMIT, FakeAVCeleb, KoDF, and DeepSpeak datasets (Fig. 7, 8).
  • CUPID maintains accuracy under strong compression (H.264 with crf40) and downscaling, outperforming baselines by up to 5% AUC.
  • Inference with CUPID is substantially faster as it avoids per-POI retraining and operates on compact embeddings.
  • Contrastive loss combined with reconstruction and perceptual losses encourages identity-aware embeddings across multiple encoder depths.
  • Interpretability maps localize manipulated facial regions by comparing decoded UV texture embeddings of test and reference videos, enabling human-understandable explanations (Fig. 5).
  • Optimal decision thresholds of CUPID are stable across datasets and quality levels, unlike prior POI-specific methods where thresholds vary widely.

Threat model

The adversary is assumed to create manipulated videos targeting a specific POI, with capabilities to generate face-swapped or reenacted deepfakes attempting to evade detection. The defender has access to multiple pristine reference videos of the POI but cannot train a dedicated model per POI or rely on deepfake data. The adversary cannot meaningfully alter or obfuscate the UV texture mapping or force the latent representation to misclassify without inducing detectable embedding deviations.

Methodology — deep read

The authors propose CUPID, a two-stage POI deepfake detection pipeline leveraging UV texture maps and a Masked Autoencoder (MAE) trained with real videos of multiple known identities.

Threat model and assumptions: The adversary aims to forge videos of a specific Person-of-Interest (POI). The defender has access to a set of pristine reference videos of the POI at test time but cannot or does not train models on the POI data. The adversary cannot spoof the large-scale identity distribution learned by the MAE from real identities.

Data: The MAE is trained on VoxCeleb2 real videos from 5,494 identities (after removing 500 identities overlapping with FakeAVCeleb). From each identity, 21 videos are sampled uniformly, yielding a balanced training set. Four evaluation datasets with subject identity labeling are used: DF-TIMIT, FakeAVCeleb, KoDF, DeepSpeak. Each is evaluated under High Quality (HQ) and Low Quality (LQ, crf40 compressed) settings. LipSync samples are excluded.

Architecture: Input raw face videos are processed frame-by-frame. Faces are cropped and processed by a 3D Morphable Model-based (3DMM) UV texture map extractor, generating spatially normalized UV maps representing facial appearance disentangled from pose and expression. These UV maps are patch-tokenized into non-overlapping patches and fed into a Vision Transformer (ViT)-based Masked Autoencoder (MAE) architecture.

Training: During training, 50% random masking is applied to input UV patches. The encoder processes visible patches plus a learnable CLS token; the decoder reconstructs the full UV map including masked patches. The total loss is a weighted sum of three components: (1) a masked pixel-wise reconstruction loss (LREC) computed on masked patches; (2) a multi-layer NT-Xent contrastive loss (LCONT) applied on CLS embeddings at multiple encoder depths, promoting identity-aware clustering; and (3) a perceptual loss (LPERC) computed via VGG-16 features to suppress grid artifacts and encourage visually coherent reconstructions. Training hyperparameters like τ and λp balance these losses. The contrastive loss uses a multi-similarity miner to sample hard positives and negatives efficiently.

Deployment: At test time, UV texture maps are extracted without masking. For the POI under scrutiny, a reference set of Nvid pristine videos provides Nr embeddings by extracting CLS tokens from all sampled frames. The test video yields Nt embeddings. Pairwise cosine similarity between all reference and test CLS embeddings is computed, and the maximum similarity value s* forms the detection score. This reflects if the test video contains any frames consistent with the authentic POI distribution.

Interpretability: To provide explanations, intra-reference displacements δ(R)ij and test-reference displacements δ(T)ij in latent space are computed, anchored on the reference centroid embedding R. These are decoded back by the MAE decoder (without masking) to produce decoded displacement maps U(R) and U(T). The interpretability map M(T) = ||U(T) - U(R)|| highlights UV facial regions whose embeddings deviate from normal intra-reference variability, localizing manipulated areas.

Evaluation: Metrics include Area Under Curve (AUC), Equal Error Rate (EER), and robustness under compression/downscaling. Baseline comparisons include ID-Reveal and POI-forensics. Cross-identity identity exclusion guarantees no identity leakage between training and evaluation. Experiments show CUPID achieves the best detection accuracy and robustness across all four datasets and quality settings. Ablations validate the value of UV maps and multi-loss training.

Reproducibility: The authors commit to releasing code at github.com/polimi-ispl/CUPID. The VoxCeleb2 training set is public. Evaluation datasets are publicly available. Specific training hyperparameters and architectural details are described.

Example workflow end-to-end: During training, UV maps extracted from videos of thousands of real speakers are patch-tokenized and input to the MAE. The masked reconstruction, contrastive and perceptual losses encourage learning a latent space that clusters identity features while reconstructing facial textures. At test time, given a few pristine POI videos, the system extracts UV embeddings to represent the POI’s canonical facial identity. A suspect video’s UV embeddings are then compared against the reference embeddings by cosine similarity, scoring authenticity. Meanwhile, latent-space differences are decoded back to UV texture space, producing heatmaps that highlight facial areas likely manipulated in deepfakes. This pipeline enables robust, interpretable detection not requiring POI retraining or deepfake examples.

Technical innovations

  • Combining 3D Morphable Model-based UV texture maps with a Masked Autoencoder trained only on real videos to learn general identity-aware latent representations without requiring POI or deepfake data during training.
  • A multi-loss training objective integrating masked reconstruction, multi-depth contrastive, and perceptual losses to enhance both latent identity discrimination and reconstruction quality.
  • A novel POI deepfake detection scheme leveraging cosine similarity between CLS embeddings of UV maps extracted from test and reference pristine videos, enabling POI-agnostic inference.
  • An interpretability framework that decodes latent space displacement vectors to UV space residual maps localizing facial regions responsible for classification decisions, exploiting the semantic correspondence of UV texture maps.

Datasets

  • VoxCeleb2 — 150,480 videos of 5,494 identities after filtering — public
  • DF-TIMIT — 320 fake + 416 real videos, 32 identities, 512×384 pixels — public
  • FakeAVCeleb — 19,500 fake + 500 real videos, 500 identities, 224×224 pixels — public
  • KoDF — 175,776 fake + 62,166 real videos, 403 identities, 1920×1080 pixels — public
  • DeepSpeak — 14,005 fake + 16,043 real videos, 500 identities, 1280×720 pixels — public

Baselines vs proposed

  • ID-Reveal: AUC = 0.87 (avg across datasets) vs CUPID: AUC = 0.91 (Fig. 7)
  • POI-forensics: AUC = 0.88 vs CUPID: AUC = 0.91 under compression (Fig. 8)
  • Raw-frame MAE (no UV maps): AUC = 0.85 vs UV-map MAE (CUPID): AUC = 0.91 (ablation)
  • Comparisons show CUPID retains >90% accuracy under crf40 compression, whereas baselines drop by up to 10% AUC

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20302.

Fig 1

Fig 1: Overview of the proposed CUPID detector for POI deepfake

Fig 2

Fig 2: Example of 3D facial reconstruction and UV texture map extraction

Fig 3

Fig 3: Face representations: facial images vs UV texture maps (center

Fig 4

Fig 4 (page 1).

Fig 4

Fig 4: Overview of the CUPID training pipeline. For each frame of a real video of a known identity, a face crop is fed to the 3DMM-based UV extractor,

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 5

Fig 5: Sketch of the interpretability analysis enabled by CUPID. Top: a

Limitations

  • CUPID relies on accurate and stable 3DMM-based UV texture map extraction; errors or failures in 3D reconstruction may degrade detection performance.
  • Evaluation focuses on controlled datasets with known POI ground truth; real-world scenarios with diverse lighting, occlusions, or unseen deepfake generators remain untested.
  • The approach currently addresses visual-only deepfake detection, ignoring audio or audio-visual cues which could strengthen robustness.
  • Potential vulnerability to adversarial attacks targeting the UV mapping or latent space embeddings was not assessed.
  • The interpretability maps rely on averaging and residual analysis, which may dilute or mislocalize subtle manipulations in complex scenes.
  • The method depends on availability of multiple pristine reference videos per POI, which may not always be feasible.

Open questions / follow-ons

  • How does CUPID perform under adversarial manipulation or domain shift caused by unseen deepfake generation methods or environmental conditions?
  • Can the interpretability mechanism be quantitatively validated and extended to support semi-automated forensic analysis?
  • How does incorporating audio-visual signals alongside UV facial texture maps affect detection accuracy and interpretability?
  • Could the framework be adapted for online or streaming video deepfake detection with real-time constraints?

Why it matters for bot defense

For bot-defense engineers focused on person-specific deepfake detection, CUPID demonstrates a compelling approach combining geometric 3D face normalization (UV texture maps) with powerful self-supervised learning (MAE) to create identity-aware embeddings without requiring costly per-POI retraining. This method enables scalable, robust detection of attacks against high-value individuals, even when the system has no prior exposure to their identity at training time. The UV map representation also adds valuable interpretability by pinpointing manipulated facial regions, aiding human analysts in incident triage. Such interpretable and efficient POI detection techniques could complement CAPTCHA and bot-detection frameworks that rely on continuity of visual identity or liveness signals, especially in scenarios where attackers impersonate victims in video feeds. However, integrating this with general CAPTCHA schemes would require addressing practical concerns like 3D reconstruction robustness and availability of multiple genuine POI samples as references.

Cite

bibtex
@article{arxiv2606_20302,
  title={ CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection },
  author={ Giovanni Affatato and Sara Mandelli and Edoardo Daniele Cannas and Paolo Bestagini and Stefano Tubaro },
  journal={arXiv preprint arXiv:2606.20302},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20302}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution