Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

Source: arXiv:2605.04797 · Published 2026-05-06 · By Michael Soprano, Andrea Cioci, Stefano Mizzaro

TL;DR

This paper asks a practical question for misinformation screening: if you give non-expert crowd workers short audiovisual clips, how reliably can they tell authentic media from deepfakes, and if they think a clip is manipulated, can they also identify whether the alteration is in audio, video, or both? The authors study this under a matched protocol on two datasets with different characteristics: AV-Deepfake1M and the Trusted Media Challenge (TMC). They collect 10 independent judgments per sampled video, using a two-step interface that first asks for authenticity and then, only when manipulation is suspected, asks for manipulation type and a timestamp.

The main result is nuanced. Crowds are fairly conservative: authentic clips are rarely mislabeled as manipulated, but many manipulated clips are missed entirely, so the bottleneck is false negatives rather than false positives. Aggregating multiple judgments improves the screening signal for authenticity, but not enough to recover manipulations that most workers consistently fail to notice. The second major finding is that modality attribution is much harder than binary authenticity detection: even when workers correctly flag a clip as manipulated, they often cannot tell whether the problem is audio, video, or both, with joint audio-video manipulations being especially difficult. The authors conclude that crowdsourcing can be useful as a scalable authenticity screen, but not yet as a reliable tool for fine-grained manipulation localization.

Key findings

They sampled 48 videos per dataset (12 authentic, 36 manipulated) from AV-Deepfake1M and TMC and collected 960 judgments total, with 10 independent judgments per video.
For authenticity detection, majority vote on AV-Deepfake1M reached accuracy 0.438 with FPR 0.083 and FNR 0.722; on TMC it reached accuracy 0.646 with FPR 0.083 and FNR 0.444.
Dempster-Shafer aggregation improved TMC authenticity accuracy from 0.646 to 0.708 and reduced FNR from 0.444 to 0.333, but on AV-Deepfake1M accuracy only rose from 0.438 to 0.458 while FPR increased from 0.083 to 0.250.
Across individual judgments, missed manipulations dominated errors: on manipulated videos, workers selected the Real option in 54.7% of AV-Deepfake1M judgments and 36.7% of TMC judgments.
False positives on authentic videos were low and identical across both datasets: 22.5% of judgments on authentic clips were mislabeled as manipulated.
The two datasets differed strongly in crowd agreement and difficulty: the paper reports a large gap in both accuracy and agreement, with AV-Deepfake1M performing worse than TMC under the same protocol.
Manipulation type identification was substantially noisier than authenticity detection, and joint audio-video cases were particularly hard to recognize; the paper states that even when manipulation was detected, attribution to audio, video, or both remained unreliable.
Timestamp reports could still converge: when workers did flag a manipulation, their timestamps often clustered around a plausible segment, but the paper does not claim high timestamp accuracy against a ground-truth temporal boundary in the abstracted text provided.

Threat model

The adversary is a content producer who can manipulate audiovisual media by altering audio, video, or both, and whose goal is to produce a clip that crowd workers will judge as authentic. The workers are non-experts operating without external verification aids, and the task assumes they only inspect the clip in a browser-based interface. The study does not assume the adversary can change the experimental interface, tamper with the dataset labels, or adapt to each worker in real time; it evaluates human judgment under standard crowdsourcing conditions rather than an interactive attacker model.

Methodology — deep read

Threat model and assumptions: this is not a detector-training paper; it is a human-subjects study of whether crowds can serve as a screening signal for audiovisual deepfakes. The “adversary” is implicitly the producer of manipulated audiovisual content, who can alter audio, video, or both. The study assumes workers only see the clip in a browser interface, with no external verification tools, no reverse-search, and no ground-truth labels. Workers are asked to judge authenticity first, then manipulation type and a timestamp only if they believe the content is manipulated. The task is framed as misinformation-oriented authenticity assessment, not forensic attribution with expert tooling.

Data provenance and sampling: the authors use two multimodal benchmark datasets, AV-Deepfake1M and the Trusted Media Challenge (TMC). They rely on dataset-provided ground truth and do not manually synthesize or edit any media. From each dataset they stratify-sample 12 clips per condition across four labels: authentic, audio-only manipulation, video-only manipulation, and joint audio-video manipulation, yielding 48 clips per dataset and 96 total. They restrict candidate pools to browser-playable clips containing both audio and video tracks, which matters because the interface needs both modalities. The sampled clips are intentionally balanced across conditions, so the evaluation is not prevalence-weighted; the authors explicitly note that this does not reflect the full datasets’ natural class distribution. The sampled clips differ in length: AV-Deepfake1M clips have median duration 7.0 s (IQR 6.0–9.0), while TMC clips are much longer at 55.9 s (IQR 49.9–59.2).

Task design and labeling interface: the experiment is run on Prolific in two final studies, one per dataset, after an AV-Deepfake1M pilot. Workers complete a short demographics questionnaire, then watch a sequence of videos. For each clip they first choose authentic vs manipulated; if manipulated, they then choose one of three manipulation types: audio-only, video-only, or audio-video. They also mark an approximate timestamp with centisecond resolution (mm:ss.cc) for where the manipulation appears. This creates four judgment-level outcomes overall: authentic, audio-only, video-only, and audio-video. The pilot used 75 completed work units (8 videos each, 600 judgments) and led to an important design change: because longer assignments increased cognitive load and missed manipulations, the final study was shortened to 4 videos per work unit.

Aggregation, training/operational regime, and evaluation example: there is no machine training here, but there is a worker reliability weighting scheme. The final deployment used 120 completed work units per task, 480 judgments per dataset, and 10 judgments per clip from distinct workers. For authenticity aggregation they compare majority vote to Dempster-Shafer (DS). Majority vote simply counts labels, with ties broken in favor of authentic to keep aggregation conservative against false positives. DS is more elaborate: each worker’s authenticity response is weighted by a reliability score computed within the task via leave-one-out on the worker’s other authenticity answers; low-performing workers are lower-bounded at chance so they do not contribute misleading evidence. The per-worker evidence is fused with Dempster’s rule, transformed to probabilities via pignistic transformation, and then converted back to a final label. For evaluation, the authors treat manipulated as the positive class and report precision, recall, FPR, FNR, F1, and accuracy at the video level after aggregation. They also measure inter-worker consistency with Krippendorff’s alpha, majority agreement, and pairwise agreement, and they analyze timestamp consistency only on manipulated ground-truth clips that receive at least three timestamp annotations, using the median normalized timestamp, IQR, and a ±5% agreement window around the median.

Statistics and reproducibility: since the per-video sample size is limited and distributions are non-normal, the paper uses nonparametric inference. AV-Deepfake1M vs TMC comparisons use Mann-Whitney U tests; majority vote vs DS on paired video outcomes uses McNemar’s test; across manipulation types they use Kruskal-Wallis with Holm-Bonferroni post hoc tests. Multiple-comparison control is handled with Bonferroni for predefined families and Holm-Bonferroni for follow-up pairwise tests. Effect sizes are summarized with Cliff’s delta and bootstrap confidence intervals. The authors report that the studies were implemented with Crowd_Frame and release all data and the full task configuration on OSF (DOI 10.17605/OSF.IO/9RJ28). The truncated source does not provide the full timestamp-localization results or all figure-level numeric details beyond those in Table 3 and the text excerpt, so any deeper claims about timestamp accuracy would need the full paper.

Technical innovations

A matched two-dataset Prolific protocol for audiovisual deepfake judgment that uses the same interface, label space, and per-video replication across AV-Deepfake1M and TMC.
A two-step crowd label design that separates binary authenticity from manipulation type and timestamp localization, enabling analysis of where human errors occur.
A comparison of majority vote and Dempster-Shafer aggregation for authenticity screening, with worker-specific reliability estimated by leave-one-out within each task.
A video-level analysis that treats missed manipulations, false positives, manipulation-type attribution, and timestamp consistency as distinct phenomena rather than collapsing them into one accuracy number.

Datasets

AV-Deepfake1M — 48 sampled videos used in this study (from 1,146,760 total) — public benchmark
Trusted Media Challenge (TMC) — 48 sampled videos used in this study (from 6,943 total) — public benchmark

Baselines vs proposed

AV-Deepfake1M majority vote: accuracy = 0.438 vs proposed (Dempster-Shafer) = 0.458
AV-Deepfake1M majority vote: FPR = 0.083 vs proposed (Dempster-Shafer) = 0.250
AV-Deepfake1M majority vote: FNR = 0.722 vs proposed (Dempster-Shafer) = 0.639
TMC majority vote: accuracy = 0.646 vs proposed (Dempster-Shafer) = 0.708
TMC majority vote: FPR = 0.083 vs proposed (Dempster-Shafer) = 0.167
TMC majority vote: FNR = 0.444 vs proposed (Dempster-Shafer) = 0.333

Limitations

The evaluation uses only 48 clips per dataset, so per-dataset estimates have high variance and may not generalize to the full benchmark distributions.
The sampled clips are condition-balanced rather than prevalence-matched, so reported accuracy and error rates are not deployment-calibrated to real-world class priors.
The study uses Prolific workers who are mostly US-based and English-speaking; results may differ with other populations or with non-English media.
The full source excerpt does not provide complete timestamp-localization quantitative results, limiting how strongly we can assess that part of the contribution from the provided text alone.
A crowd can only assess visible/audible cues exposed by the interface; no evidence is provided on performance under adversarial prompting, deceptive metadata, or auxiliary fact-checking tools.
The DS aggregation trades false negatives for false positives on AV-Deepfake1M, so improved sensitivity comes with a material cost in false alarms.

Open questions / follow-ons

How much can timestamp localization improve if workers are given better temporal aids, such as scrubbing controls, channel-specific replay, or forced localization annotations on every manipulated clip?
Would calibration feedback, example-based training, or adversarially selected clip difficulty improve manipulation-type attribution, especially for joint audio-video forgeries?
Can a hybrid pipeline use crowd authenticity judgments to triage content for automated forensic models without amplifying the false-positive trade-off seen under Dempster-Shafer on AV-Deepfake1M?
How stable are these findings under different prevalence priors, different languages, or shorter-form social media clips where cues may be even more compressed?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the key takeaway is that crowdsourced human judgment is usable as a coarse screening layer, but only for the binary question of “something seems off,” not for precise forensic attribution. If you are designing a moderation or trust pipeline, this suggests crowds can help prioritize media for deeper review when the goal is to reduce false negatives on manipulated content, but they will not reliably tell you whether the issue is in audio, video, or both. The conservative behavior observed here also matters operationally: workers rarely accuse authentic media of being fake, so the crowd signal may have decent precision for authenticity but poor recall on subtler manipulations.

In practice, that means a bot-defense system should treat crowd labels as a weak, noisy signal to be fused with other evidence, not as a standalone detector. The large gap between authenticity detection and manipulation-type attribution implies that workflows needing fine-grained labels will need either expert review or machine assistance. The DS-vs-majority comparison is especially relevant: reliability weighting can increase sensitivity, but at the cost of more false positives, so any deployment would need an explicit thresholding policy and a clear tolerance for review load.

Cite

bibtex

@article{arxiv2605_04797,
  title={ Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes },
  author={ Michael Soprano and Andrea Cioci and Stefano Mizzaro },
  journal={arXiv preprint arXiv:2605.04797},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.04797}
}

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​