Human, AI, and Hybrid Ensembles for Detection of Adaptive, RL-based Social Bots

Source: arXiv:2603.23796 · Published 2026-03-25 · By Valerio La Gatta, Nathan Subrahmanian, Kaitlyn Wang, Larry Birnbaum, V. S. Subrahmanian

TL;DR

This paper studies a question that most bot-detection work ignores: what happens when the bots are not static, but reinforcement-learning agents that adapt during the campaign? The authors use data from a five-day, IRB-approved social-media experiment in which 225 MTurk participants interacted on a synthetic platform (DartPost) infiltrated by four RL-trained covert social influence operations, each pushing a different political topic. The central comparison is humans versus automated detectors versus hybrid ensembles that combine human reports with AI scores.

The main result is that humans alone are not consistently strong at detecting these adaptive bots, and several intuitive predictors fail or reverse. Older participants outperformed younger ones, native English speakers outperformed non-native speakers, but higher formal education did surprisingly worse than lower formal education. There was no measurable learning effect over the five days. At the same time, active engagement with bots and some forms of bot exposure correlated with better detection, and simple collective-intelligence signals were informative: accounts reported multiple times were much more likely to be bots. The strongest practical takeaway is that hybrid systems that fuse human reports with AI predictions beat either source alone, though the paper’s own excerpt does not provide the full numeric comparison table for those hybrid methods.

Key findings

Across demographics, age reversed the expected pattern: participants 35+ achieved F1 = 0.570 vs 0.344 for those under 35, with FDR-corrected p = 0.045.
Native English speakers substantially outperformed non-native speakers on bot detection: F1 = 0.516 vs 0.153, FDR-corrected p = 0.009.
Education also reversed intuition: bachelor’s-or-below participants scored F1 = 0.542 vs 0.068 for master’s+ participants, FDR-corrected p = 0.028.
Twitter usage was not a significant predictor (F1 = 0.496 for users vs 0.350 for non-users, p = 0.646), and Reddit usage also was not significant (F1 = 0.277 for users vs 0.538 for non-users, p = 0.208).
No learning effect was detected over the five-day study: day-specific F1 regression slope β = 3.10×10^-4 (p = 0.987), and cumulative F1 slope β = 5.44×10^-2 (p = 0.379).
Active engagement with bots correlated positively with detection: BER_follow predicted F1 with β = 0.005 (p = 0.003) and BER_like with β = 0.006 (p = 7.1×10^-5).
Bot exposure via likes also correlated positively with detection: BXR_like predicted F1 with β = 0.007 (p = 0.017), but BXR_follow was not significant (p > 0.70).
Report frequency was a strong proxy for bot likelihood: P(bot | k=1 reports) = 0.467, P(bot | k=2) = 0.632, and P(bot | k=4) = 0.857, compared with the base rate P(bot) = 0.262.

Threat model

The adversary is an RL-driven social-bot operator in a covert social influence operation. It can adapt its posting, liking, following, and account-activation behavior over time, observe platform responses, and optimize for both influence and evasion of an automated detector. The adversary cannot directly see the participants’ private survey responses, and the human subjects do not receive ground-truth feedback on their reports during the experiment. The bots are evaluated in a controlled, IRB-approved synthetic platform rather than on a live public platform.

Methodology — deep read

Threat model and study setting: the paper evaluates human detection of adaptive bots operating in active covert social influence operations, not static sockpuppets. The adversary is an RL-based bot operator that can change posting, liking, following, and account-activation behavior over time to maximize influence while evading an automated detector. The authors explicitly frame the bots as adaptive and the human participants as unaware of which accounts are bots; participants were not given ground-truth feedback about their reports, so any learning would have to come from implicit experience rather than supervision. The experiment is IRB-approved and conducted on DartPost, a controlled platform designed to emulate mainstream social media without using real-world platforms that ban CSIOs.

Data provenance and composition: the study uses data from Kumar et al. [24], described here as a controlled five-day MTurk experiment with 225 U.S. participants. Of the 225 recruited participants, 86 actively engaged with the platform; 32 participated daily and 54 were sporadic. The platform contained 305 accounts total: 225 human accounts and 80 bot accounts. The bot side consisted of four concurrent RL_CSIO campaigns, each controlling 20 bot accounts, and the campaigns targeted four topics: U.S. COVID-19 response adequacy, a 2% wealth tax above $50 million, Medicare for all who want it, and GMO food safety. The excerpt says the authors will release additional data upon publication, but it does not state that code or frozen weights are already public. The only part of the data originally released by [24] was a subset; this paper uses the broader experiment logs. Preprocessing is mostly behavioral aggregation: the authors derive per-user bot-detection labels from participant reports, and compute engagement/exposure ratios from follow/like interactions. For demographic analysis, they stratify participants by gender, age (<35 vs >=35), English proficiency, education, and platform use. For collective-intelligence analysis, they compute counts of how many times each account was reported, then estimate conditional probabilities by report count.

Model / algorithmic setup: the RL bot controller is taken from [24], so the contribution here is not a new bot generator but an evaluation of its detectability. The RL_CSIO agent uses an actor-critic architecture with three main pieces: a graph convolutional network encoder for the current social-network state; an actor that outputs action probabilities for each controlled bot; and a critic that estimates value for state-action pairs. The state includes graph topology plus account features such as follower counts, posting behavior, sentiment, role, centrality, blocked status, and activity level. The action space includes posting with chosen sentiment, following users, liking posts, remaining inactive, and activating dormant reserve accounts. The reward function is multi-objective: activation rewards for gaining followers, infection rewards for shifting neighbors’ sentiment, termination rewards for full conversion, and penalties for suspension by the detector. The excerpt does not restate the hyperparameters, optimizer, replay scheme, episode lengths, or seed policy for the RL agent, so those details remain inherited from [24] rather than newly specified here.

Training and deployment regime: the bots were trained in synthetic environments populated with simulated accounts before deployment, again following [24]. Those simulated users had probabilistic behavior models for content creation, engagement, and inactivity; sentiment drift was updated with exponentially weighted moving averages after exposure, approximating gradual opinion change. During deployment, each CSIO began with five active bot accounts plus access to 15 reserve accounts for replacement after suspensions. The network topology was initialized with realistic follower counts, averaging roughly 3–6 connections per account. For baseline automated detection, the study uses the random-forest bot detector from [15], which periodically scanned all 305 accounts using behavioral features such as posting frequency, follower growth patterns, and network characteristics. The paper excerpt does not provide training details for the human-report aggregation models beyond the fact that several weighting and retraining schemes were tested; it also does not give hyperparameters for the AI detectors beyond identifying them as state-of-the-art traditional ML and LLM-based methods.

Evaluation protocol and statistics: the paper tests 13 hypotheses spanning demographics, temporal learning, engagement/exposure, collective intelligence, and hybridization. Human performance is reported primarily as precision, recall, and F1 on the bot class, with Table 1 giving demographic F1 scores and FDR-corrected permutation-test p-values. Temporal trends are shown in Fig. 1 with two scoring schemes: day-specific (an account is bot if reported on that day) and cumulative (an account is bot if reported on any day up to that day). Engagement and exposure effects are visualized in Fig. 2 and quantified with linear regressions using BER and BXR predictors. Collective intelligence is analyzed in Fig. 3 via conditional bot probability given report count, with the baseline bot prevalence explicitly set at 0.262 (80/305). The authors state that all p-values are adjusted using Benjamini–Hochberg FDR correction and that statistical significance is assessed with conventional thresholds. They also say they compare human ensembles against three AI detectors spanning traditional machine learning and LLMs, and they study aggregation strategies plus retraining protocols that leverage human supervision, but the excerpt cuts off before the full numeric results for those comparisons are shown. As a concrete end-to-end example, a participant who repeatedly likes or follows bot accounts may end up inspecting the bot profile more closely, which the authors operationalize through higher BER and BXR_like; those participants then show higher F1 on the bot class, and the same reports, when aggregated across users, increase P(bot | k reports) sharply above the 0.262 base rate.

Reproducibility and reporting: the paper explicitly notes that additional data will be released upon publication, implying the current arXiv version is not yet a fully complete public release. The excerpt does not mention code availability, frozen model checkpoints, or exact random seeds. Statistical testing is more rigorous than many social-bot papers: permutation tests and FDR correction are used throughout, and Figure 1 uses regression slopes rather than eyeballing trends. However, because the excerpt is truncated, the exact implementation details of the hybrid ensemble methods, the retraining protocol, and the three AI baselines are not fully recoverable from the provided text.

Technical innovations

First controlled evaluation of human detection against RL-based adaptive bots in active CSIOs, rather than static bot accounts.
Use of report-frequency aggregation as a collective-intelligence signal, showing monotonic increases in P(bot | k reports) from 0.467 at one report to 0.857 at four reports.
Hybrid human-AI ensemble design that combines participant reports with detector scores and retraining from human supervision, extending prior static-bot workflows.
Operationalization of engagement and exposure via BER and BXR to distinguish voluntary interaction with bots from passive receipt of bot activity.

Datasets

Kumar et al. five-day DartPost CSIO experiment — 305 accounts (225 humans, 80 bots) and 225 recruited MTurk participants — controlled IRB-approved synthetic social-media study on DartPost.

Baselines vs proposed

Human participants vs base-rate bot prevalence: precision/recall/F1 reported per subgroup; no single overall benchmark value is provided in the excerpt.
Random-forest detector from [15]: periodically scanned 305 accounts using behavioral features; no numeric metric is given in the excerpt.
Three AI detectors (traditional ML + LLMs): compared against human ensemble performance; exact metrics are not included in the excerpt.
Unreported accounts: P(bot | k=0) = 0.167 vs overall base rate P(bot) = 0.262.
Accounts reported once: P(bot | k=1) = 0.467 vs base rate P(bot) = 0.262.
Accounts reported twice: P(bot | k=2) = 0.632 vs base rate P(bot) = 0.262.
Accounts reported four times: P(bot | k=4) = 0.857 vs base rate P(bot) = 0.262.

Limitations

The excerpt truncates the hybrid-ensemble section, so the exact metrics for human+AI combinations and retraining protocols are not visible here.
The human study has a relatively small effective engaged sample: only 86 of 225 recruited participants actively used the platform, and subgroup sizes are sometimes very small (e.g., female N=21, master’s+ vs bachelor’s- N=14).
The experiment is synthetic and platform-controlled (DartPost), so ecological validity relative to real social platforms is limited even if the setting is realistic.
The bots are trained to evade an automated detector, not explicitly to evade human judgment; the human-detection results may change under a human-aware adversary.
The paper reports many significance tests across hypotheses; although FDR correction is used, the study still risks multiple-comparison and subgroup-instability issues.
The excerpt does not specify the full hyperparameters, seeds, or exact implementations of the compared AI models, which limits exact reproducibility from the provided text alone.

Open questions / follow-ons

Would the same human-vs-AI-vs-hybrid ranking hold if bots were optimized against human suspicion rather than an automated detector?
Which specific human signals contribute most to hybrid gains: frequency of reports, reporter reliability weighting, or temporal recency?
Can the engagement/exposure effects in Fig. 2 be turned into active interventions, such as interface changes that increase careful inspection without overburdening users?
How stable are these findings across different topics, platform designs, and languages beyond the U.S.-English MTurk sample?

Why it matters for bot defense

For a bot-defense engineer, the core lesson is that adaptive adversaries can make pure classifier-based detection brittle, and human judgment still contains useful signal when it is aggregated carefully. The report-frequency results in Fig. 3 suggest a practical triage strategy: human reports are not evenly informative, but repeated independent reports can be a strong risk feature for escalation, review, or secondary-model retraining.

The hybrid result is especially relevant operationally: if AI scores and human reports are complementary, a production system should treat them as different sensors rather than substitutes. The paper also warns against assuming that more user experience, education, or platform familiarity automatically improves detection. That implies bot-defense teams should test reviewer pools empirically, not by intuition, and should expect performance shifts when adversaries adapt in the wild.

Cite

bibtex

@article{arxiv2603_23796,
  title={ Human, AI, and Hybrid Ensembles for Detection of Adaptive, RL-based Social Bots },
  author={ Valerio La Gatta and Nathan Subrahmanian and Kaitlyn Wang and Larry Birnbaum and V. S. Subrahmanian },
  journal={arXiv preprint arXiv:2603.23796},
  year={ 2026 },
  url={https://arxiv.org/abs/2603.23796}
}

Human, AI, and Hybrid Ensembles for Detection of Adaptive, RL-based Social Bots ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​