When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

Source: arXiv:2606.04098 · Published 2026-06-02 · By Tao Yu, Yujia Yang, Shenghua Chai, Zhang Jinshuai, Haopeng Jin, Hao Wang et al.

TL;DR

This paper addresses the growing challenge of detecting video misinformation that operates at the semantic and evidential level, where visual content may be authentic but is misleading through editorial manipulations or AI-generated alterations. Prior forensic and deepfake detection methods mostly rely on low-level artifacts and cannot verify misinformation when the falsehood depends on external event evidence outside the video itself. To fill this gap, the authors propose EVID-Bench, a novel benchmark consisting of 222 rigorously verified manipulated videos spanning nine manipulation types across AI generation, single-source editing, and multi-source editing categories. These manipulations distort event semantics via identity swaps, temporal reordering, contextual fabrications, and synthetic inserts, making them undetectable by leading multimodal models through visual inspection alone.

The benchmark task formalizes search-grounded video misinformation detection: systems must iteratively retrieve external videos from the open web and perform contrastive reasoning to identify specific misinformation points and provide detailed explanations. Nine state-of-the-art multimodal models evaluated using a retrieval-augmented verification baseline achieve at best 61.43% point-level and only 43.24% video-level accuracy, with AI-generated manipulations proving especially challenging (<20% accuracy). Error analyses reveal that models often fixate on irrelevant salient anchors, misclassify manipulation types, fail to identify synthetic content, and stop reasoning prematurely. These results highlight fundamental limitations of current visual-only or shallow retrieval methods, underscoring the complexity of verifying semantic-level video misinformation through external evidence retrieval and multi-video reasoning.

Key findings

EVID-Bench contains 222 videos covering 9 manipulation types across 3 categories (AI generation, single-source editing, multi-source editing) and 6 real-world topics.
All samples are verified to be undetectable by frontier multimodal models (e.g., Gemini-3.1-Pro, GPT-5.5) via visual inspection alone.
Best-performing model (GPT-5.5) achieves 61.43% point-level accuracy but only 43.24% video-level accuracy, indicating difficulty in fully explaining manipulations.
AI-generated manipulations yield the lowest detection: GPT-5.5 achieves 11.11%–16.67% point-level and near-zero video-level accuracy on identity swap, synthetic insertion, and object manipulation tasks.
Multi-source editing tasks have higher point-level accuracy (up to 99%) but still low video-level accuracy (~40–55%).
Models frequently fixate on salient but irrelevant anchors like video titles or logos, leading to reasoning errors.
Models fail to recognize synthetic insertions as AI-generated content, often mistaking them for real footage from other sources.
Iterative retrieval and verification pipelines often stop prematurely once a related source is found, without fully verifying all manipulated points.

Threat model

The adversary produces semantically manipulative videos by selectively editing, reordering, or splicing authentic footage, or inserting AI-generated fake content. They aim to mislead viewers by altering event narratives without leaving detectable visual artifacts in individual frames. The adversary's manipulations may span multiple sources or rely on temporal distortions. The system does not assume the adversary can prevent web search for related videos but accepts that evidence for verification resides outside the single input video and must be retrieved externally.

Methodology — deep read

The paper formalizes search-grounded video misinformation detection, where the input is a potentially manipulated video and the system must retrieve corroborating videos from the open web and identify false information via cross-video comparison. The threat model assumes adversaries producing videos with semantic-level manipulations—e.g., editorial splicing or AI-generated inserts—that do not leave detectable low-level pixel artifacts. The challenge is that evidence of manipulation lies outside the input video and must be retrieved externally.

Data provenance involves 222 videos covering 9 manipulation types spanning AI generations (identity swaps using FaceFusion, synthetic insertions with Seedance 2.0, object manipulations), single-source editorial edits (selective omission, causal inversion, manipulative montage), and multi-source editing (narrative fabrication, magnitude manipulation, contextual fabrication). Nine professional editors crafted the editorial manipulations; all samples were quality-verified by computer science researchers and filtered through model-based verification using Gemini-3.1-Pro and GPT-5.5 to ensure frontier models cannot detect manipulations via visual means alone.

The benchmark covers six real-world topic domains to reflect typical video misinformation contexts. Each sample is annotated with multiple misinformation points (3–5 per video), with ground truth explanations specifying the manipulation.

The baseline model pipeline has two stages: (1) Chain-of-Thought (CoT) analysis by a video-language model extracting frame-level observations, narrative reasoning, entity recognition, and anomaly detection to formulate a retrieval plan and generate initial search queries; (2) An iterative DeepSearch loop that repeatedly queries the web for candidate videos, applies a coarse frame-based relevance filter followed by fine-grained cross-video contrastive analysis to extract precise forgery points, then judges evidence sufficiency to continue or stop. Retrieval focuses on authentic anchors, avoiding synthetic elements.

Training details are not explicitly stated since evaluation focuses on prompt-based video-language models without fine-tuning. Nine frontier multimodal models (GPT-5.5, GPT-5.4, Claude Opus 4.6, Gemini-3.1-Pro, Qwen-3.5-plus, etc.) are evaluated using this retrieval-augmented baseline.

Evaluation metrics include point-level accuracy (fraction of ground-truth misinformation points correctly identified) and video-level accuracy (all points must be correctly identified, thus stricter). Outputs—natural language descriptions—are assessed by majority vote among three LLM judges (GPT-5.4, Gemini-3.1-Pro, Claude Sonnet 4.6) for semantic equivalence to ground truth.

The evaluation protocol does not employ cross-validation but evaluates models uniformly over the benchmark; details on reproducibility include publicly released code and datasets on GitHub and Huggingface. However, retrieval depends on live web search APIs, limiting strict reproducibility over time. One concrete example in Figure 4 illustrates iterative retrieval guided by CoT frame reasoning to identify a multi-source splicing forgery involving an educational documentary clip inserted into unrelated footage, gradually refining queries and evidence until manipulation explanation is sufficient.

Overall, the methodology closely mimics human fact-checking practices by integrating frame-level video understanding with reasoning-driven iterative search and cross-video verification, a novel setting not addressed by existing video forensic or multimodal misinformation benchmarks.

Technical innovations

Formalization of search-grounded video misinformation detection requiring iterative external video retrieval and cross-video semantic contrastive reasoning, beyond visual artifact detection.
EVID-Bench dataset construction with rigorously verified, hard-to-detect video manipulation samples spanning 9 fine-grained forgery types and 6 topical domains.
A retrieval-augmented verification pipeline employing chain-of-thought video-language model analysis to generate retrieval plans and an iterative DeepSearch loop for coarse-to-fine evidence filtering and sufficiency judgment.
Use of multiple state-of-the-art LLM-based evaluators to robustly assess natural language manipulation explanations against ground truth.

Datasets

EVID-Bench — 222 videos — curated from professional edits and AI-generation, verified by human and frontier ML models, public on Huggingface

Baselines vs proposed

GPT-5.5: point-level accuracy = 61.43% vs best previous models scoring below 60%
GPT-5.5: video-level accuracy = 43.24% vs Gemini-3.1-Pro at 36.90%
AI Generation (Identity Swap): GPT-5.5 point-level accuracy = 11.11%, video-level accuracy = 4.17%
Multi-Source Editing (Narrative Fabrication): GPT-5.5 point-level accuracy = 82.24%, video-level accuracy = 51.85%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.04098.

Fig 1: Overview of EVID-Bench construction and the search-grounded video misinformation detection pipeline.

Fig 2 (page 1).

Fig 3 (page 1).

Fig 2: Taxonomy of search-grounded video misinformation in EVID-Bench covering AI Generation, Single-Source

Fig 5 (page 2).

Fig 6 (page 2).

Fig 7 (page 2).

Fig 8 (page 2).

Limitations

Benchmark size of 222 videos limits statistical power for fine-grained analysis across all manipulation subtypes.
Evaluation depends on external web search APIs, so retrieval results and thus performance may vary over time affecting reproducibility.
Quality assurance relies on current frontier models; future stronger models may detect these manipulations visually without retrieval, reducing benchmark difficulty.
LLM-based evaluation of natural language outputs may introduce semantic matching biases compared to human judgments.
The pipeline is evaluated on closed test sets without adversarial or adaptive attacks to test robustness.
No training or fine-tuning of multimodal models specific to EVID-Bench tasks is described, possibly limiting upper bounds.

Open questions / follow-ons

How can retrieval strategies and query generation be improved to better escape semantic error clusters and retrieve comprehensive evidence?
What model architectures or training paradigms could better integrate multimodal temporal reasoning with cross-video contrastive verification?
Can detection of AI-generated inserts be enhanced by improved synthetic content recognition beyond current VLM capabilities?
How to develop better sufficiency judgments that avoid premature stopping and fully explain complex, multi-point manipulations?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights an advanced class of video misinformation that cannot be detected via traditional visual artifact detectors alone, requiring external evidence retrieval and reasoning. Designing CAPTCHAs or defense mechanisms that rely solely on visual clues or single-content verification will be insufficient against such semantic-level manipulations. Instead, systems must incorporate retrieval-augmented verification pipelines that reason over multi-source evidence to flag misinformation. The results indicate current multimodal models struggle even with curated benchmarks, so defense systems need to consider iterative search, query refinement, and cross-video comparison for robust detection. Error analysis suggests common pitfalls to avoid, such as fixation on irrelevant features and premature stopping, which can guide system design to remain cautious and reasoning-driven. Overall, EVID-Bench sets a new frontier testing ground for robust video misinformation detection under adversarial editing tactics critical for secure content verification.

Cite

bibtex

@article{arxiv2606_04098,
  title={ When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection },
  author={ Tao Yu and Yujia Yang and Shenghua Chai and Zhang Jinshuai and Haopeng Jin and Hao Wang and Minghui Zhang and Zhongtian Luo and Yuchen Long and Xinlong Chen and Jiabing Yang and Zhaolu Kang and Yuxuan Zhou and Zhengyu Man and Xinming Wang and Hongzhu Yi and Zheqi He and Xi Yang and Yan Huang and Liang Wang },
  journal={arXiv preprint arXiv:2606.04098},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.04098}
}

When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​