Gender Bias in YouTube Exposure: Allocative and Structural Inequalities in Political Information Environments

Source: arXiv:2604.27479 · Published 2026-04-30 · By Jipeng Tan, Weifeng Zhang, Ye Wu, Jialin Guo, Yong Min

TL;DR

This paper investigates whether YouTube's recommendation algorithm produces gendered asymmetries in political information exposure. Rather than studying real users — where prior preferences are confounded with algorithmic effects — the authors construct 160 controlled social bots (80 male-coded, 80 female-coded) that differ only in their behavioral category signals (e.g., Sports/Gaming vs. Howto & Style), while holding all other conditions constant. This allows them to isolate the algorithm's response to gendered behavioral cues, independent of any pre-existing political preference differences.

The study examines bias along two complementary axes. 'Allocative bias' concerns what content is recommended — specifically differences in issue distribution (using the Comparative Agendas Project taxonomy), ideological lean (Left/Right/Least Biased), and political entity visibility. 'Structural bias' concerns how the recommendation space is topologically organized — whether male-coded and female-coded profiles end up in distinct network communities with different density, modularity, and clustering characteristics. Time-series analysis further tracks whether these structural positions reinforce themselves over time through the exposure→click→re-exposure feedback loop.

The authors find statistically significant allocative differences: male-coded profiles receive more content related to macroeconomic and defense issues, while female-coded profiles receive more content on social and health policy. Structural analysis reveals that the co-exposure network organizes into clearly segregated communities largely aligned with gender-coded profile groups, and lagged regression models show that community-level exposure structures in stage t predict click and re-exposure outcomes in stage t+1 more strongly than out-group structures. A supplementary collaborative-filtering simulation demonstrates that even a weak gender-homophily bias parameter (β) in an otherwise standard cosine-similarity recommender is sufficient to reproduce the observed content and structural differentiation from random initial conditions.

Key findings

LLM-assisted coding (GPT-4o mini) of political video content achieved issue classification accuracy of 89.4% (Cohen's κ = 0.81) and ideology classification accuracy of 93.1% (Cohen's κ = 0.76), with entity F1 = 88.9%, validating the annotation pipeline.
Male-coded profiles received statistically significantly higher proportions of political content related to macroeconomic affairs and national defense/security, while female-coded profiles were directed more toward social policy and health-related political content, as measured across the final 50 recommendation steps of 150-step trajectories.
Co-exposure network modularity (Louvain algorithm, threshold θ=20) was higher in the final-50-step window than the first-50-step window, indicating that structural segregation between male-coded and female-coded communities intensifies over time rather than being a static artifact of initialization.
Lagged regression models show that within-community exposure issue vectors at stage t are more predictive of click issue composition at stage t+1 than within-group-outside-community or out-group vectors, suggesting community boundaries actively channel subsequent behavior.
The collaborative-filtering simulation demonstrates that introducing even a weak gender-homophily weight β > 0 into cosine-similarity-based recommendations is sufficient to generate both allocative content divergence and structural network segregation from random (uniform) initial political preference vectors — no pre-seeded political differences are required.
Shannon entropy analysis of content diversity shows statistically significant differences between male-coded and female-coded profiles in both overall category entropy and structural diversity (three-class: interest/political/other), with results robust across alternative terminal windows (last 30, 40, 50, 60, 70 steps).
Entity extraction from video titles reveals asymmetric visibility of political actors: male-coded profiles show higher co-occurrence of entities associated with economic and foreign-policy figures, while female-coded profiles show higher prevalence of entities associated with social/healthcare policy, reinforcing the issue-distribution findings.
Jaccard overlap analysis at the video-ID level confirms that accounts within the same detected community share more specific video IDs than accounts in the same profile group but outside the community, indicating that community structure — not just gender coding — is the proximate organizing mechanism.

Threat model

The implicit adversary is the platform recommendation system itself, modeled as an entity that ingests user behavioral signals and outputs ranked content. The 'adversarial capability' is the algorithm's ability to read gendered behavioral cues (category watch history distribution, dwell time, click patterns) and route users into structurally differentiated information environments without explicit demographic knowledge. The algorithm is assumed to have full access to within-session behavioral signals but not to ground-truth demographic identity (consistent with YouTube's stated data practices). What the algorithm cannot do, by the experimental design's assumptions, is distinguish between bots and real users based on the controlled behavioral signals — a significant assumption that the authors acknowledge but cannot fully validate. The study does not consider adversarial users attempting to manipulate the algorithm, nor does it model a human adversary exploiting the bias for disinformation purposes; the 'threat' is systemic and structural rather than intentional.

Methodology — deep read

Threat model and assumptions: The adversary in this context is the platform's recommendation algorithm itself, conceptualized as an entity that may amplify social inequalities. The authors assume YouTube's algorithm responds to behavioral signals (watch history category distribution, dwell time, click patterns) and that gendered behavioral categories as defined by prior audience demographic research (Thelwall & Foster 2021; OpenSlate data) serve as valid proxies for gender-coded behavioral cues. Crucially, the design assumes that holding all non-behavioral features constant (account creation time, operating environment, viewing protocol) isolates the algorithm's differential treatment. The study does not claim to reverse-engineer YouTube's algorithm; rather, it treats the algorithm as a black box and measures its outputs.

Bot construction and behavioral profile operationalization: 160 accounts were created — 80 male-coded, 80 female-coded. Male-coded profiles were assigned behavioral interest categories of Autos & Vehicles, Sports, and Gaming; female-coded profiles were assigned Howto & Style and People & Blogs. These category assignments are grounded in published YouTube audience gender-skew data. All accounts also shared a 'dynamic preference' category of News & Politics. All other account parameters (creation timestamp, geographic/IP environment, interaction protocol) were held identical across groups. This is the core experimental control: any observed recommendation differences must therefore be attributable to the behavioral category signal rather than confounding account-level features.

Training phase: Each account watched 10 pre-selected 'seed videos' at an 80:20 ratio — 80% from the gendered interest category, 20% from News & Politics. Video completion rate was enforced at >80%, with a maximum dwell time cap of 2,700 seconds per video to avoid triggering anomaly detection. This phase was designed to activate YouTube's user profiling mechanism and seed the collaborative-filtering model with a differentiated behavioral history before the measurement phase begins.

Testing phase and data collection: Each account then executed 150 consecutive interaction steps along the platform's recommendation stream. At each step, the automated program recorded all recommended videos and clicked according to a fixed 80/20 rule (80% gendered-interest, 20% News & Politics). Inter-click time intervals were sampled from a power-law distribution (α = −1) to approximate realistic bursty browsing behavior. Data collected per step: recommended video metadata (title, category, channel), clicked video IDs, and dwell time. The full 150-step trajectories, as well as the final 50 steps (the primary analytical window), were retained for analysis. Robustness checks used alternative windows of last 30, 40, 60, and 70 steps.

Allocative bias measurement: Political video identification used YouTube's native category tag (News & Politics). For the 50-step analytical window, three measurements were computed per account: (1) proportion of political content (simple frequency); (2) Shannon entropy over all YouTube categories for overall diversity; (3) Shannon entropy over three aggregated classes (interest/political/other) for structural diversity. Political videos were then coded using GPT-4o mini for issue (CAP taxonomy, 21 categories), ideology (Left/Right/Least Biased), and named entity extraction from titles. A 10% random sample was manually coded to validate LLM outputs, yielding κ = 0.81 for issues and κ = 0.76 for ideology. Each account was then represented as an issue-distribution vector, and cosine similarity was used to measure within-group vs. cross-group similarity. Group-level comparisons used appropriate statistical tests (the paper references statistical significance but specific test statistics — t-tests, Mann-Whitney, etc. — are not fully enumerated in the truncated text; this is a mild transparency gap).

Structural bias measurement: A co-exposure network was constructed with accounts as nodes and edge weights equal to the number of shared recommended video IDs (Equation 3). Edges were retained only when weight exceeded threshold θ = 20 (robustness checks at θ = 10, 15, 25, 30). Network density (Equation 4), weighted clustering coefficient (Equation 5), and weighted modularity via the Louvain algorithm (Equation 6) were computed. Community detection outputs were then used to classify each account into a community, and the dominant issue/ideological composition of each community was profiled. To test dynamic reinforcement, the 150 steps were divided into consecutive stages, and lagged regression models (Equations 8 and 9) regressed stage t+1 click issue shares on stage t exposure issue shares at four structural levels (self, within-community, within-group-outside-community, out-group), and vice versa for exposure on prior clicks, with issue, stage, and group fixed effects. Jaccard overlap (Equation 7) at video-ID level supplemented the issue-vector cosine analysis.

Collaborative-filtering simulation (Study 2): A population of agents with gender labels was initialized with 21-dimensional issue-salience vectors drawn uniformly at random and normalized to sum to 1 — explicitly removing any initial political preference differences between gender groups. At each time step, the system computed a composite similarity score between agent pairs as cosine similarity plus β·δ(gender_i, gender_j), where β is the gender-homophily weight (Equation 11). Content was then recommended via softmax sampling over similarity scores, and agents' state vectors were updated to incorporate recommended content (self-reinforcement). The simulation was run across a range of β values to identify the threshold at which allocative and structural divergence emerges. This is explicitly framed as a mechanistic illustration, not a claim about YouTube's actual implementation.

Reproducibility concerns: The paper does not mention public code release or a frozen dataset. The bots were run on live YouTube infrastructure, meaning the experiment is not trivially reproducible (platform behavior may have changed; account creation at scale may violate ToS). The GPT-4o mini prompts used for coding are not provided in the truncated text, which limits replication of the annotation pipeline.

Technical innovations

Two-dimensional audit framework distinguishing allocative bias (content distribution differences) from structural bias (network topology differences) in recommendation systems, extending prior work that focused exclusively on output-level content disparities.
Co-exposure network construction using shared video-ID overlap as edge weights (Equation 3), enabling social network analysis of recommendation trajectories across accounts — a methodological departure from single-account or aggregate content audits.
Lagged regression framework (Equations 8–9) with four-level structural reference vectors (self, community, within-group-outside-community, out-group) to empirically test whether community boundaries causally channel subsequent clicks and re-exposures, rather than merely correlating with them.
Collaborative-filtering simulation showing that a single weak gender-homophily parameter β superimposed on standard cosine-similarity recommendations is sufficient to generate both content-level and structural divergence from randomized initial conditions, providing a minimal mechanistic explanation without requiring access to YouTube's internals.
LLM-assisted hybrid coding pipeline using GPT-4o mini with 10% manual validation for multi-label political content classification (CAP issues, ideology, named entities) at scale across bot-generated recommendation trajectories, achieving κ > 0.76 across all categories.

Datasets

YouTube recommendation trajectories (bot-generated) — 160 accounts × 150 steps = 24,000 recommendation events — non-public, collected by authors via controlled bot experiment on live YouTube platform
Seed video corpus — 10 videos per account × 160 accounts, drawn from predefined gender-coded and News & Politics categories — non-public, curated by authors
Comparative Agendas Project (CAP) issue taxonomy — 21 core issue categories — public codebook used as annotation schema (https://www.comparativeagendas.net/)

Baselines vs proposed

Male-coded vs. female-coded profiles (issue distribution cosine similarity): within-group similarity > cross-group similarity — exact values not extractable from truncated text
Co-exposure network modularity, first-50 steps vs. last-50 steps: modularity increases over time indicating structural segregation strengthens — exact Q values not reported in truncated text
Collaborative-filtering simulation, β=0 (no gender homophily) vs. β>0 (weak homophily): β=0 produces no group-level divergence; β>0 sufficient to reproduce observed content and structural differentiation — specific β threshold value not extractable from truncated text
Lagged regression: within-community exposure vector (β2 coefficient) > out-group exposure vector (β4 coefficient) in predicting subsequent click issue composition — exact coefficient magnitudes not extractable from truncated text

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.27479.

Fig 1

Fig 1 (page 1).

Limitations

Bot behavioral profiles are constructed from category-level gender skew data (Thelwall & Foster 2021; OpenSlate), which reflects population-level averages and may not represent the diversity of real male or female users; the mapping from category preferences to 'gender coding' is a simplification that conflates gender identity with platform behavior patterns.
The experiment runs on live YouTube infrastructure with no code or dataset release mentioned, making direct replication impossible; platform algorithm updates between the experiment date and publication would make re-running the exact experiment non-comparable.
The 160-account sample (80 per group) is modest for network analysis; community detection via Louvain is sensitive to resolution parameters and network sparsity, and the paper does not report confidence intervals or permutation tests on modularity scores to establish that the observed community structure is not an artifact of threshold choice.
Causal identification is limited: while the controlled design removes pre-existing preference confounds, the 80/20 click rule is deterministic and may not reflect real user decision variance; real users exhibit heterogeneous click propensities that could amplify or dampen the observed biases.
The GPT-4o mini annotation prompts are not provided in the available text, and inter-rater reliability is assessed only on a 10% random sample; it is unclear whether the sample was stratified by issue category or profile group, leaving open the possibility of systematic annotation errors in underrepresented categories.
The collaborative-filtering simulation is a stylized model (21-dimensional uniform initialization, simple softmax) that demonstrates sufficiency of a weak homophily signal but does not constitute evidence that this is YouTube's actual mechanism; alternative mechanisms (e.g., engagement-rate optimization, advertiser targeting) could produce identical observed patterns.
The study is US/English-language-centric by design (YouTube political content, CAP taxonomy) and does not test whether findings generalize to other language contexts or platform regions where gender-category associations may differ.

Open questions / follow-ons

Would the observed structural bias persist or intensify with longer interaction horizons (e.g., 500+ steps), and is there an equilibrium at which gender-coded communities fully decouple in the co-exposure network?
Do other demographic behavioral proxies (age-coded, race-coded via content category associations) produce analogous structural segregation, and do multiple intersecting behavioral signals compound or attenuate each other's effects in the recommendation network?
Can algorithmic interventions — such as diversity-constrained re-ranking or cross-community exposure injection — break the feedback loop identified in the lagged regression models without degrading engagement metrics, and at what β threshold does the homophily signal become too weak to sustain structural segregation?
The simulation uses a single β parameter for gender homophily; real collaborative filtering systems have heterogeneous user-item embeddings — how does the embedding geometry (dimensionality, initialization, update rule) interact with even weak demographic signals to produce or suppress structural bias at scale?

Why it matters for bot defense

For bot-defense engineers, this paper is a double-edged data point. On one hand, it demonstrates that carefully constructed social bots with controlled behavioral profiles can operate undetected on YouTube for 150+ interaction steps with power-law-distributed inter-click timing — the bots successfully activated YouTube's profiling mechanism and received differentiated recommendation outputs. This is a direct signal that behavioral mimicry (realistic dwell times, category-consistent click ratios, bursty timing distributions) is sufficient to evade platform-level bot detection at the session level, at least for research-scale deployments. Bot-defense teams should note that the detection signal here would need to come from cross-account behavioral correlation (all 160 bots share identical non-behavioral parameters and were presumably created in a coordinated fashion) rather than from single-session anomaly detection.

From a platform integrity perspective, the finding that a weak gender-homophily signal in collaborative filtering can generate measurable structural segregation has implications for how recommendation audits should be designed. Standard A/B testing on individual accounts will miss structural effects that only manifest at the network level across accounts. Bot-defense infrastructure that monitors for coordinated inauthentic behavior (CIB) could, in principle, also be repurposed to detect whether recommendation outputs are clustering in ways that suggest demographic steering — though this would require access to cross-account recommendation logs that are typically siloed. The paper's co-exposure network methodology is directly applicable as an audit tool that bot-defense and trust-and-safety teams could run internally on anonymized recommendation logs to surface structural bias before it becomes a regulatory or reputational liability.

Cite

bibtex

@article{arxiv2604_27479,
  title={ Gender Bias in YouTube Exposure: Allocative and Structural Inequalities in Political Information Environments },
  author={ Jipeng Tan and Weifeng Zhang and Ye Wu and Jialin Guo and Yong Min },
  journal={arXiv preprint arXiv:2604.27479},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.27479}
}

Gender Bias in YouTube Exposure: Allocative and Structural Inequalities in Political Information Environments ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​