Skip to content

Identifying social bots via heterogeneous motifs based on Naïve Bayes model

Source: arXiv:2512.22759 · Published 2025-12-28 · By Yijun Ran, Jingjing Xiao, Xiao-Ke Xu

TL;DR

This paper addresses the critical problem of detecting social bots on social media platforms by leveraging the heterogeneity of local network structures through a theoretically grounded motif-based framework. Unlike prior topology-based methods that treat network motifs homogeneously or rely on intuition, the authors propose refining traditional homogeneous motifs into heterogeneous motifs by integrating node-label information (human vs bot). This captures the distinct neighborhood preferences around nodes, which better differentiate bots from humans in social networks. The approach uses a Naïve Bayes probabilistic model to quantify the contribution of each motif's node pairs to the likelihood of a node being a bot, providing interpretability and a nonparametric foundation. They also derive a mathematical upper bound called the maximum capability to evaluate each motif's detection potential and guide feature selection. Extensive experiments on four large public social bot datasets demonstrate that heterogeneous motifs outperform homogeneous motifs and existing state-of-the-art baselines across multiple metrics. Selecting a subset of motifs with the highest maximum capabilities achieves comparable detection results to using all motifs, balancing effectiveness and complexity. Overall, the work offers a principled, interpretable, and empirically validated framework that advances social bot detection by explicitly modeling heterogeneous local network patterns.

Key findings

  • Refining 30 homogeneous 3-node motifs into 114 heterogeneous motifs by incorporating node-label information significantly improves detection accuracy across four datasets.
  • Second-order homogeneous motifs (e.g., M30) outperform first-order and closed motifs in capturing complex structural patterns useful for bot detection.
  • The proposed Naïve Bayes likelihood score combining node pair contributions within heterogeneous motifs achieves better AUC, Precision, Recall, and F1 scores than all evaluated baselines.
  • Using all 114 heterogeneous motifs as features yields the best detection performance; however, selecting motifs with top maximum capability values achieves comparable results with fewer features.
  • Their method surpasses state-of-the-art baselines including BotRGCN, RGT, Ising, and Botometer on publicly available datasets Cresci-15, MGTAB, TwiBot-20, and TwiBot-22 over five metrics.
  • Maximum capability provides a theoretical upper bound on each motif's contribution, guiding effective feature selection for bot detection.
  • The Naïve Bayes framework enables a nonparametric, interpretable quantification of motif contributions, unlike heuristic or deep learning-based methods.
  • The time complexity of motif discovery using G-Tries is manageable for 3-node motifs with O(114*3!) operations for heterogeneous motifs.

Threat model

The adversary consists of malicious social bots designed to impersonate humans on social media by manipulating follower/following relationships. The attacker can create and link multiple bot accounts but cannot arbitrarily alter the global graph structure outside their control. The defender has access to labeled training data indicating bot and human nodes, and the goal is to identify bots based solely on network topology and node connections without relying on content or metadata. The adversary is assumed not to directly manipulate or obfuscate heterogeneous motif structures beyond normal operations.

Methodology — deep read

  1. Threat model & assumptions: The adversary is a social bot operator creating automated accounts intended to mimic humans on social media networks. The attacker cannot directly manipulate the network topology beyond their bots' connections, and node labels indicating bot or human are partially known for training. The method assumes the graph is directed and does not contain self-loops or multiple edges between nodes.

  2. Data provenance and preprocessing: Four publicly available Twitter bot detection datasets are used—Cresci-15 (1,741 nodes, 6,214 edges), MGTAB (~9.4k nodes, 425k edges), TwiBot-20 (~206k nodes, 227k edges), and TwiBot-22 (~694k nodes, 3.7M edges). Each dataset provides ground truth labels for bot and human nodes. The authors build directed follower/following heterogeneous graphs incorporating node labels and remove self-loops and duplicated edges.

  3. Architecture & algorithm: The core concept involves two motif types: homogeneous motifs, defined as sets of 3-node directed subgraphs classified into 30 unique motif types sensitive to the position of the target node, and heterogeneous motifs, which refine homogeneous motifs by integrating the bot/human labels of neighboring nodes, expanding motif types to 114.

Their Naïve Bayes model assumes conditional independence of contributions from each node pair forming these motifs with the target node. The likelihood ratio for the node being a bot vs human is computed by multiplying contributions from all node pairs in motifs. Contributions are estimated empirically as ratios of motif counts formed by bots vs humans for each node pair, with smoothing by adding 1 to avoid zero division.

The framework calculates a likelihood score for each motif and sums log-likelihoods to get a final score per node. These scores are used as features in an XGBoost classifier for supervised bot detection.

  1. Training regime: The XGBoost classifier is trained with hyperparameter tuning via grid search and evaluated using ten-fold cross-validation with balanced folds. The exact number of epochs and seeds is not specified, but cross-validation and grid search imply robust training.

  2. Evaluation protocol: The model is assessed using five metrics—Accuracy, Precision, Recall, F1 score, and AUC—on each dataset. Baselines include unsupervised graph-based methods (Ising, SybilWalk, SybilSCAR), traditional feature-based supervised methods (Botometer, FP, ARG), and advanced deep learning approaches (BotRGCN, DeeProBot, RGT, T5 transformer). Ablation studies compare homogeneous vs heterogeneous motifs and analyze the impact of motif order and positions.

  3. Reproducibility: The paper does not explicitly mention code or data releases beyond using public datasets. Details on fixed random seeds or frozen weights are not provided. Motif discovery uses the G-Trie data structure with known complexity.

Concrete Example: For a target node A, the method identifies heterogeneous motifs involving A and node pairs (B,C), (B,D), (C,E), etc. For each pair, it calculates the contribution ratio from empirical counts of that motif among bots/humans. Multiplying these contributions according to Eq(11) gives a likelihood score for A. This score feeds into XGBoost for final bot classification. For example, heterogeneous motif Y1 with pair (B,C) forming the motif provides one contribution, while pair (D,E) not forming the motif contributes differently, highlighting the heterogeneity of neighborhood influence.

Technical innovations

  • Refinement of homogeneous 3-node motifs into heterogeneous motifs by explicitly incorporating node-label (bot/human) information to capture neighborhood preference heterogeneity.
  • Development of a Naïve Bayes theoretical framework to systematically quantify the contribution of different node pairs within heterogeneous motifs to social bot detection likelihood.
  • Mathematical derivation and introduction of the maximum capability metric as an upper bound to evaluate and select motifs based on their potential detection effectiveness.
  • Integration of motif-based Naïve Bayes likelihood scores as features in an XGBoost classifier for accurate, interpretable, and nonparametric social bot classification.

Datasets

  • Cresci-15 — 1,741 nodes, 6,214 edges — publicly available Twitter bot dataset
  • MGTAB — 9,443 nodes, 425,863 edges — publicly available Twitter bot dataset
  • TwiBot-20 — 205,730 nodes, 227,477 edges — publicly available Twitter bot dataset
  • TwiBot-22 — 693,761 nodes, 3,711,903 edges — publicly available Twitter bot dataset

Baselines vs proposed

  • Ising: AUC ranges reported lower than proposed method (exact values not specified); heterogeneous motifs outperform Ising on all datasets.
  • SybilWalk: Performance on multiple datasets outperformed by heterogeneous motif method by several percentage points in F1 and AUC.
  • SybilSCAR: Consistently worse than heterogeneous motif framework in Precision and Recall.
  • Botometer: Metadata-feature based supervised baseline with lower detection metrics compared to heterogeneous motif + XGBoost approach.
  • DeeProBot, BotRGCN, RGT: State-of-the-art deep learning and GNN baselines; heterogeneous motif method achieves better AUC and F1 scores across Cresci-15, TwiBot-20, and TwiBot-22.
  • Homogeneous motifs (All): Lower detection performance compared to heterogeneous motifs (All), confirming the added value of node-label integration.
  • Motif selection by maximum capability: Top selected motifs achieve similar performance to using all 114 without significant metric loss.

Limitations

  • Motif discovery and enumeration are computationally feasible only because the method restricts motifs to size 3; scalability to larger motifs is not demonstrated.
  • The assumption of conditional independence in the Naïve Bayes model may oversimplify complex dependencies between node pairs.
  • The empirical estimation of node pair contributions depends on labeled data distributions that may not generalize well to new or evolving social bots.
  • No adversarial robustness evaluation against adaptive bot strategies designed to evade motif-based detection is provided.
  • The approach focuses only on follower/following topologies; content and temporal dynamics are not integrated, which may limit detection of sophisticated bots.
  • No code or model weight release is mentioned, which may impede reproducibility and adoption.

Open questions / follow-ons

  • Can the motif-based Naïve Bayes framework be extended to incorporate temporal dynamics or behavioral features alongside topology?
  • How robust is the heterogeneous motif approach against adversarial manipulations explicitly designed to mimic human neighborhood preferences?
  • What is the computational feasibility and detection value of scaling motifs beyond 3 nodes to capture higher-order network patterns?
  • Can this framework be adapted to other types of networks and bot detection scenarios beyond Twitter-like follower structures?

Why it matters for bot defense

This work provides valuable insights for bot-defense engineers focusing on social media platforms, emphasizing that local heterogeneous network patterns can provide strong signals to distinguish bots from humans. Unlike content or metadata-based methods, this motif-based approach formalizes and interprets topological features for bot detection with theoretical guarantees via the maximum capability metric. CAPTCHAs and other interaction-based human verification mechanisms might benefit from integrating network-structural signals as continuous background features, potentially improving detection accuracy and interpretability. Additionally, selecting a small subset of motifs based on their learning potential could reduce complexity in live systems.

However, since the method focuses on static follower/following graphs and does not consider active user interactions or behavioral signals, CAPTCHAs operating on session-level or real-time inputs should consider combining this topological approach with more dynamic defenses. The interpretability and theoretical foundations may also help security teams audit and explain bot detection results to reduce false positives, a critical requirement in customer-facing bot defense.

Cite

bibtex
@article{arxiv2512_22759,
  title={ Identifying social bots via heterogeneous motifs based on Naïve Bayes model },
  author={ Yijun Ran and Jingjing Xiao and Xiao-Ke Xu },
  journal={arXiv preprint arXiv:2512.22759},
  year={ 2025 },
  url={https://arxiv.org/abs/2512.22759}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution