ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

Source: arXiv:2606.04706 · Published 2026-06-03 · By Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao

TL;DR

This paper addresses the challenging problem of detecting AI-generated videos, which have become increasingly realistic and threaten misinformation and content trust. The authors investigate reconstruction error, derived from a pretrained video VAE (WF-VAE), as a discriminative forensic signal, based on the finding that real and AI-generated videos exhibit distinctive temporal patterns of frame-wise reconstruction errors. They propose ReConFuse, a novel framework that fuses these low-level reconstruction error cues with high-level semantic tokens extracted by a pretrained visual semantic encoder (XCLIP), aligning them patch-wise and modeling their temporal evolution via a Mamba-based sequence model for video-level classification. Extensive experiments on two large-scale AI-generated video detection benchmarks, GenVideo and GenBuster, demonstrate strong performance and generalization compared to state-of-the-art baselines. Ablations confirm the critical contributions of reconstruction error cues, semantic alignment, patch-level fusion, and temporal modeling.

Key findings

Using WF-VAE reconstruction error as a forensic cue improves video-level detection accuracy from 79.56% to 88.33% and F1-score from 77.14% to 88.18% compared to semantic-only baseline on GenVideo.
ReConFuse outperforms DeMamba on GenVideo one-to-many setting with average accuracy 85.56% vs 79.56% and F1-score 85.03% vs 71.75%, while achieving comparable AUROC.
On GenVideo many-to-many setting, ReConFuse achieves accuracy 91.93% vs 90.17%, F1-score 91.49% vs 89.52%, and AUROC 95.82% vs 95.40%, outperforming strong baselines.
On GenBuster benchmark, ReConFuse improves average F1-score from 90.40% (DeMamba) to 91.23% and AUROC from 99.25% to 99.37%.
Temporal modeling with Mamba module outperforms mean pooling, Transformer, LSTM, ResNet50, and 1D CNN in accuracy and F1-score, validating design choice.
Patch-level alignment and gated fusion of reconstruction errors with semantic tokens is necessary to suppress ambiguous reconstruction artifacts and boost discriminative power.
Signed reconstruction error (preserving direction) provides stronger detection signals than absolute or squared error formulations.
Different VAEs yield different reconstruction error patterns; WF-VAE produces temporally consistent, smaller reconstruction errors better suited to capture generation traces.

Threat model

The adversary is an entity generating forged or AI-synthesized videos with unknown internal generative process aiming to evade detection. They may modify video content or quality but cannot directly manipulate the internal reconstruction prior used by the detector (WF-VAE). The defender observes only the final video frames without side information or embedded watermarks. The adversary's goal is to produce videos whose reconstruction error and semantic feature temporal patterns appear realistic to evade ReConFuse detection.

Methodology — deep read

The threat model assumes an adversary producing AI-generated videos attempting to evade detection. The defender has black-box access to the video but no internal knowledge of the generation process. The key assumption is that real and generated videos differ in how well they fit a pretrained video reconstruction prior.

The approach begins with reconstructing input videos using a pretrained Wavelet-Driven Energy Flow Video VAE (WF-VAE), which serves as a learned reconstruction prior trained to compress and reconstruct video frames. Frame-wise signed reconstruction errors (original frame minus reconstructed frame) are computed to reveal subtle low-level distribution discrepancies left by generation.

To provide semantic context to these error cues, the method extracts patch-level semantic token features from frames using the pretrained XCLIP vision encoder, which produces rich spatial semantic embeddings. Reconstruction error maps are resized and embedded into patch-level tokens aligned spatially with the semantic tokens, preserving locality.

The core fusion module applies a gated mechanism that learns a per-patch gating factor (sigmoid) to selectively inject reconstruction error tokens into the semantic tokens, mitigating ambiguity caused by content-dependent reconstruction variations (e.g., motion blur, textures).

This fused sequence of semantic-error tokens across T video frames is input into a Mamba-based temporal modeling module, which efficiently captures temporal evolution and cross-frame dependencies of reconstruction anomalies relevant to forgery.

The resulting temporal representation is concatenated with a global semantic context vector and passed through a linear layer followed by sigmoid to predict video-level real/fake probability. The model is trained end-to-end with binary cross-entropy loss on labeled real and generated videos.

Experiments evaluate on two benchmarks: GenVideo (one-to-many and many-to-many splits with 10k+ real and generated videos per train/test sets) and GenBuster (commercial generator videos). Metrics used include accuracy, F1-score, recall, and AUROC with fixed threshold evaluation. Baselines include DeMamba, DIRE, AEROBLADE, and D3.

Ablation studies systematically remove or replace components (reconstruction error, semantic fusion, patch alignment, temporal module) to demonstrate their individual contributions. Variants of reconstruction priors (other VAEs) and error formulations (signed, absolute, squared) are also compared.

No mention is made of publicly released code or pretrained weights; datasets are existing AI-generated video detection benchmarks: GenVideo and GenBuster.

A concrete example: given an input video, ReConFuse reconstructs it with WF-VAE to get frame-wise error maps, resizes errors to semantic token grids, computes gated fusion, feeds token sequences to Mamba temporal module, then outputs a real/fake score. This pipeline is trained and evaluated on large-scale labeled video benchmarks with multiple generative models.

Technical innovations

Leveraging pretrained WF-VAE video reconstruction errors as temporally evolving forensic cues for AI-generated video detection.
Gated patch-level fusion of reconstruction error tokens with semantically rich XCLIP visual tokens to disambiguate error origin.
Using the Mamba temporal sequence module to efficiently model temporal evolution of fused semantic-error features for video-level classification.
Demonstration that signed reconstruction errors contain richer forensic information than magnitude-only errors in this context.

Datasets

GenVideo — 10,000+ real and AI-generated videos for training and testing — public benchmark
GenBuster — Subsets from 8 commercial AI video generators — public benchmark

Baselines vs proposed

DeMamba: Accuracy = 79.56% vs ReConFuse: 85.56% (one-to-many GenVideo)
DeMamba: F1-score = 71.75% vs ReConFuse: 85.03% (one-to-many GenVideo)
DeMamba: Accuracy = 90.17% vs ReConFuse: 91.93% (many-to-many GenVideo)
DeMamba: F1-score = 89.52% vs ReConFuse: 91.49% (many-to-many GenVideo)
DeMamba: F1-score = 90.40% vs ReConFuse: 91.23% (GenBuster)
DeMamba: AUROC = 99.25% vs ReConFuse: 99.37% (GenBuster)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.04706.

Fig 1

Fig 1: Reconstruction behavior comparison between real and AI-generated videos

Fig 2

Fig 2: Overview of the proposed ReConFuse framework. The input video is recon-

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

No adversarial attack evaluations to test robustness against adaptive forgers.
Details on computational costs and inference latency not provided, relevant for deployment scenarios.
Evaluation limited to existing benchmarks with moderate diversity; generalization to novel unseen generators remains uncertain.
Code and pretrained model release status unspecified, potentially hindering reproducibility by practitioners.
The approach relies on pretrained WF-VAE and XCLIP backbones, which may not generalize outside their training distributions.
Method mainly focuses on spatial-temporal coherence via reconstruction error but does not incorporate multimodal cues like audio or metadata.

Open questions / follow-ons

How robust is reconstruction-error guided detection against adaptive forgeries that explicitly minimize reconstruction discrepancies?
Can multimodal signals (e.g., audio, metadata) be fused with reconstruction-semantic cues to enhance detection robustness?
How does the method scale and perform on longer videos or higher-resolution inputs beyond benchmark clips?
What are the trade-offs between different video VAE reconstruction priors for forensic cue extraction and generalization?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights the utility of video reconstruction errors as a complementary forensic signal beyond conventional spatial-temporal artifact analysis for detecting AI-generated video content. Integrating learned reconstruction priors with semantic context and temporal modeling can yield stronger, temporally consistent detection cues to identify synthetic video deepfakes or manipulation attempts. However, deploying such methods in real-world bot-defense or anti-spam pipelines requires careful consideration of computational overhead and robustness against adversaries aware of reconstruction-based detection. The patch-level fusion and efficient temporal modeling approach could inspire similar hybrid embeddings for multimodal bot detection tasks. Overall, this research suggests a promising direction to enhance bot-defense systems with learned reconstruction priors to detect increasingly sophisticated video generation or alteration attacks.

Cite

bibtex

@article{arxiv2606_04706,
  title={ ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection },
  author={ Xiaojing Chen and Xinyu Lu and Changtao Miao and Yunfeng Diao },
  journal={arXiv preprint arXiv:2606.04706},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.04706}
}

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​