CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

Source: arXiv:2605.30241 · Published 2026-05-28 · By Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka

TL;DR

CommunityFact introduces a dynamic, multilingual, and multi-domain benchmark for misinformation detection that better reflects real-world, ongoing misinformation verification challenges on social media. Unlike static fact-checking datasets, CommunityFact draws from X's Community Notes to build a refreshable corpus of 15,992 standalone factual claims in five languages (English, Spanish, French, Japanese, Portuguese) across politics and finance domains. Claims are labeled true or false based on crowd-curated Community Notes corrections. The benchmark supports fine-grained temporal, linguistic, and domain-specific evaluation slices.

The authors evaluate ten large language models (LLMs) across four inference-time capabilities — closed-input, explicit reasoning, open web search, and evidence-guided web-search using Community Notes URLs — revealing wide performance gaps. Closed-input verification is difficult even for large multilingual LLMs, while access to web retrieval substantially improves accuracy. Moreover, two of three web-enabled models significantly improve by aligning their retrieval sources closer to credible, crowd-curated sources cited in Community Notes; the third model’s generic retrieval ecosystem misalignment leads to degraded performance when evidence guidance is applied. This demonstrates that source selection policy is a crucial factor in verification quality for web-enabled LLMs. CommunityFact is released with an automatic pipeline enabling continuous updates, supporting future research in claim-conditioned retrieval and multilingual misinformation detection.

Key findings

The CommunityFact benchmark contains 15,992 standalone claims with 9,970 TRUE and 6,022 FALSE examples across 5 languages and 2 domains (politics, finance).
Closed-input instruction-tuned LLMs achieve modest macro-F1 scores for temporal test split, with top model Aya-Expanse-32B reaching 69.93, while smaller or less-tuned models fall below 40 (Table 2).
Explicit reasoning at inference time (+Thinking) often reduces accuracy for Qwen3-14B (62.83 → 58.92) and Qwen3-32B (65.68 → 61.15), indicating reasoning alone without evidence is unreliable.
Web-enabled LLMs significantly improve performance: GPT-5-nano from 38.09 to 75.59, Gemini-2.5-Flash from 68.01 to 78.13, Grok-4.3 from 53.85 to 83.80 macro-F1 under open web search (Table 2).
Adding evidence-guided web search using Community Notes URLs improves GPT-5-nano (+1.48 p.p., p<0.05) and Grok-4.3 (+1.97 p.p., p<0.001) but decreases Gemini-2.5-Flash by 0.67 p.p. (n.s.).
Source domain hit-ratio with human-vetted Community Notes sources rises from 0.62 to 0.94 for Grok-4.3 and 0.13 to 0.56 for GPT-5-nano under evidence guidance, but remains near zero for Gemini-2.5-Flash (Table 3).
The two improving models follow opposite retrieval strategies: GPT-5-nano expands retrieval (1.0 →1.4 URLs/claim), Grok-4.3 prunes away over-broad retrieval (21.9 →13.9 URLs/claim), both aligning better to human-cited sources.
Significant variation in verification performance exists across language–domain slices (e.g., GPT-5-nano scores 59.45 on French finance vs. 83.62 on Japanese finance).

Threat model

An adversary disseminates misinformation claims on social media, aiming to evade detection or correction by automated fact-checking systems. The adversary may rely on rapidly evolving events, language diversity, and local contexts that challenge static or outdated verification datasets. Defenders are large language models with optional web retrieval capabilities trying to accurately classify new claims as true or false without the adversary’s control over evidence or community-generated corrections.

Methodology — deep read

Threat Model & Assumptions: The adversary is a misinformation spreader disseminating false claims on social media platforms (specifically X), attempting to evade detection and factual correction. The defense focuses on automated large language models (LLMs) that verify claims dynamically, assuming access or lack thereof to external evidence such as web search results. The adversary cannot directly control Community Notes or the web content used for verification.
Data: CommunityFact is constructed from X’s Community Notes, a crowdsourced fact-checking mechanism where volunteer contributors attach helpful context and evidence URLs to potentially misleading posts. The authors use data from notes created in 2025, with a lag for rating stabilization until March 2026. Starting from 11,545 post–note pairs across five major languages, they extract 15,992 standalone factual claims labeled TRUE or FALSE depending on whether the notes support or refute the claim. Claims are independently verified through human audits incorporating native annotators' majority voting. The dataset is split temporally within each language-domain group to create training (12,414 examples) and test (3,578 examples) sets, preventing source leakage and simulating real-world verification of future claims.
Architecture / Algorithm: The study benchmarks 10 large multilingual language models covering instruction-following (e.g., Aya-Expanse-8B, EuroLLM-22B), reasoning-enhanced open-weight models (Qwen3-14B/32B), and web-enabled LLMs (GPT-5-nano, Gemini-2.5-Flash, Grok-4.3). The web-enabled models interact with live web searches at inference time, optionally guided by evidence URLs cited in Community Notes to influence retrieval. Prompts include claim text, timestamps, optional retrieved evidence, and ask for a binary TRUE/FALSE judgement, standardized across languages. Reasoning ("+Thinking") conditions query the model to use explicit chain-of-thought before answering.
Training Regime: The LLMs are evaluated zero-shot with fixed checkpoints or APIs, without task-specific fine-tuning on CommunityFact. Prompting is deterministic decoding for open-weight models; API parameters are set per capabilities. The CommunityFact pipeline for claim extraction employs GPT-5.53 with a medium reasoning effort prompt, plus a self-refinement step auditing claim quality and label correctness.
Evaluation Protocol: The primary metric is macro-F1 score over TRUE and FALSE labels, computed overall and sliced by language and domain. Performance differences across capability tiers and inference settings (instruction-only, +Thinking, +Web-search, +Evidence-guided web-search) are analyzed. Statistical significance for the impact of evidence-guided search is assessed via paired significance tests on overall results. Human audits validate claim quality, label accuracy, and domain assignment. Source domain citation overlap is computed to quantify alignment between LLM retrievals and human-curated Community Notes sources.
Reproducibility: The dataset and benchmark construction pipeline are released publicly (https://github.com/sahajps/CommunityFact, https://hf.co/datasets/sahajps/CommunityFact). Exact model weights for proprietary APIs are not released. The construction pipeline can produce updated future snapshots from new Community Notes archives. Complete prompt templates and validation protocols are available in appendices.

Concrete Example: For a given post-note pair in English about a recent political claim in 2025, the claim extraction step reformulates the assertion into a standalone claim independent of original post text or media. The claim is labeled FALSE if the note refutes the assertion. The claim, timestamp, label, language and domain metadata, plus community-cited URLs form a benchmark example. During evaluation, a web-enabled LLM receives the claim text and timestamp plus evidence URLs to retrieve relevant supporting documents dynamically and produces a TRUE/FALSE label. The output is scored against the ground truth, contributing to aggregate and slice-level macro-F1 statistics.

Technical innovations

A dynamic misinformation benchmark constructed from continuously updated public Community Notes, enabling refreshable evaluation aligned with real-world social-media misinformation.
Multilingual, multi-domain claim extraction methodology converting noisy post-note pairs into standalone, text-verifiable claims with note-grounded binary labels, supporting fine-grained slice evaluations.
Capability-stratified evaluation of LLMs (instruction-only, reasoning-enabled, open web search, evidence-guided web search) on temporally held-out claims, revealing systemic verification challenges and benefits of retrieval.
Demonstration and quantification of web-enabled LLMs’ misalignment with human-curated evidence ecosystems and mechanism to improve verification by aligning retrieval towards crowd-vetted source domains via retrieval expansion or pruning.

Datasets

CommunityFact — 15,992 claims — constructed from X's Community Notes public archive (2025 data, released 2026)

Baselines vs proposed

Aya-Expanse-32B (instruction-following): macro-F1 = 69.93 vs web-enabled Grok-4.3: 83.80
Qwen3-14B (reasoning): macro-F1 = 62.83 vs +Thinking: 58.92 (performance decreased)
GPT-5-nano (web-search): 38.09 vs +Web-search: 75.59 macro-F1
Gemini-2.5-Flash (web-search): 68.01 vs +Web-search: 78.13 macro-F1
Grok-4.3 (web-search): 53.85 vs +Web-search: 83.80 macro-F1
Evidence-guided web-search vs open web-search: GPT-5-nano +1.48 p.p. (significant), Grok-4.3 +1.97 p.p. (significant), Gemini-2.5-Flash −0.67 p.p. (n.s.)

Limitations

CommunityFact coverage reflects distribution of claims and languages in Community Notes, leading to long-tail language and regional representation bias.
Labels derive from Community Notes rather than independent expert fact-checks, introducing potential label noise or interpretive disagreements.
Current claims are text-only and exclude multi-modal misinformation involving images, videos, or other media modalities.
Performance variability across language and domain slices suggests incomplete generalization and potential gaps for lower-resource languages or specialized domains.
The evidence-guided web-search is evaluated as a retrieval alignment method at inference; no training or fine-tuning of models on CommunityFact claim-evidence pairs was performed.
Closed-input evaluation may conflate limited parametric knowledge with temporal split difficulty; models trained on older data are at a natural disadvantage.

Open questions / follow-ons

How can claim-conditioned source suggesters be effectively learned from Community Notes data to dynamically guide retrieval for novel, unseen claims?
Can richer label taxonomies beyond binary true/false better capture interpretive ambiguity and subtle factual nuances in social media misinformation?
How to extend the benchmark and evaluation protocols to multimodal misinformation verification involving images, videos, or combined modalities?
What are robust methods to improve LLM retrieval alignment to human-curated sources across low-resource languages and underrepresented domains?

Why it matters for bot defense

This study highlights the critical role of evidence access and source selection in automated misinformation detection with LLMs, which is directly relevant for bot defense systems that utilize natural language understanding to assess the factuality or intent behind user-generated content. Captchas and bot defenses could incorporate multilingual, temporally fresh datasets like CommunityFact to train or benchmark subsystems intended to detect misinformation-laden inputs. The demonstrated importance of retrieval from aligned, credible sources suggests that bot defenses relying on LLMs should incorporate mechanisms to prioritize community-verified or authoritative evidence rather than generic web content to reduce false positives or negatives. Moreover, the finding that explicit reasoning alone does not improve performance without retrieval underscores that reasoning modules should be tightly coupled with verified evidence rather than used in isolation. Finally, the temporal evaluation and multilingual support provided by CommunityFact support robust detection of misinformation bots operating across languages and evolving topics, a crucial feature for scalable bot mitigation.

Cite

bibtex

@article{arxiv2605_30241,
  title={ CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild },
  author={ Sahajpreet Singh and Insyirah Mujtahid and Min-Yen Kan and Kokil Jaidka },
  journal={arXiv preprint arXiv:2605.30241},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30241}
}

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​