Exposed: Shedding Blacklight on Online Privacy

Source: arXiv:2512.24041 · Published 2025-12-30 · By Lucas Shen, Gaurav Sood

TL;DR

This paper addresses a persistent gap in the online tracking literature: nearly all prior work measures tracking prevalence from a website-centric perspective (what fraction of the top-N sites deploy tracker X?), which systematically underestimates how much individual users actually encounter those trackers during real browsing. The authors correct this by combining a month-long passive metering dataset of 1,132 representative U.S. adults (6.3M web visits, June 2022, sourced from YouGov/RealityMine) with domain-level privacy audits from Blacklight, a tool that programmatically simulates fresh visits and detects seven tracking technologies. This user-centric approach lets them estimate both cumulative and rate-normalized exposure and decompose exposure by demographic group.

The core empirical results are stark. Ad trackers and third-party cookies are effectively universal: 99.6% and 99.4% of users encountered at least one within the month, respectively, and the median user hit ~10K ad trackers and ~12K third-party cookies. More invasive techniques — session recording, keylogging, and canvas fingerprinting — are rarer per visit (~3–6% of visits) but still reach 85–92% of users at least once within the month, with half the sample encountering them within 48 hours. A single organization, overwhelmingly Google (dominant tracker for 89% of users), can observe a median of 54% of any given user's entire browsing history.

Demographic disparities in exposure are real but modest and largely mediated by browsing volume. Education and age gaps in cumulative exposure largely disappear after rate-normalizing, meaning more educated and older users browse more — not more-tracked sites — explaining higher raw counts. Residual rate-level disparities by age (older users face higher per-visit tracker rates) and by race (Asian users visit systematically less-tracked sites once volume is controlled) persist, suggesting site-choice — what people browse, not just how much — carries demographic structure with privacy consequences.

Key findings

Ad trackers and third-party cookies are near-universal: 99.6% and 99.4% of 1,132 users encountered at least one in the one-month observation window; 99.1% encountered more than ten of either type (Table 2).
Invasive tracking (session recording, keylogging, canvas fingerprinting) reaches 85–92% of users at least once per month despite appearing on only ~3–6% of individual visits; over 65% of users encountered all three at least ten times (Table 2).
Exposure is rapid: 50% of users encounter an ad tracker or third-party cookie within 12 hours of the start of measurement; within 48 hours, ~79% have hit ad trackers and ~52% have hit canvas fingerprinting (Figure 1).
A single organization — Google in 89% of cases — can observe a median of 54% of a user's entire browsing history (mean 55%, σ=0.16); at the 75th percentile that share is 66% (Figure 3d). Google's reach spans 99.6% of the sample (1,129 of 1,132 users).
Users are tracked by a median of 242 distinct organizations, but exposure is highly concentrated: the median Gini coefficient across organizations' tracking shares is 0.73 (Figure 3b).
Education gaps in cumulative exposure (e.g., college-educated users encounter ~16,090 more third-party cookies, p<0.01) largely vanish after rate-normalization, indicating they stem from more time online rather than visiting more heavily tracked sites (Tables 4–5).
Age disparities persist even after rate-normalization: users 65+ encounter ~2.81 more ad trackers per visit (p<0.001) and ~3.07 more third-party cookie instances per visit than the 18–24 reference group (Table 5), and the age gradient is monotonically increasing (Figure 2).
Demographic variables collectively explain less than 8% of variance in exposure across all models (R² ≤ 0.07, Tables 4–5), underscoring that individual browsing habits dominate demographic predictors.

Threat model

The implicit adversary is the commercial tracking ecosystem: third-party ad networks, analytics providers, data brokers, and large platform operators (Google, Meta, Microsoft, Amazon) who embed JavaScript and pixel tags across first-party websites to collect behavioral data without direct user consent or awareness. Adversary capabilities include: persistent cross-site tracking via third-party cookies; stateless fingerprinting (canvas, WebGL) that survives cookie deletion; session recording and keylogging scripts that exfiltrate fine-grained interaction data; and organizational consolidation that lets a single entity (Google) observe a median 54% of a user's browsing history by aggregating signals across its owned and partnered properties. The adversary is assumed to have read access to any page that embeds their scripts. What the adversary cannot do (within scope of this paper) is access traffic on domains where their scripts are absent, decrypt HTTPS payloads beyond what the embedded script sees, or track activity outside the browser (e.g., mobile apps, offline behavior). The paper does not model active adversaries who might attack the metering panel or Blacklight infrastructure.

Methodology — deep read

Threat model and assumptions: The paper is not a security-attack paper per se, but its implicit adversary is the tracking ecosystem itself — third-party ad networks, analytics providers, and data brokers who embed scripts across millions of sites. The authors assume that Blacklight's detection of a tracking technology on a domain is a valid proxy for that technology being active during a real user visit. A key assumed limitation is that Blacklight simulates a clean, stateless browser visit, which may not capture tracker behavior conditional on a returning or logged-in user. Missing data (46.75% of unique domains had no successful Blacklight result) is assumed to be missing completely at random — an assumption the authors flag but do not formally test.

Data — browsing panel: YouGov maintains a U.S. adult panel recruited via matched sampling from a synthetic representative frame (Rivers and Bailey 2009). 1,200 panelists volunteered to install RealityMine passive metering software in exchange for rewards, yielding domain-level logs with anonymized URLs and visit timestamps for June 2022. After removing 65 individuals with no activity, one with missing URL metadata, and two with no Blacklight-linked data, the analytic sample is n=1,132. The final dataset covers 6,297,382 visits to 64,074 unique domains. Demographics (gender, race in five categories, education in four levels, age in five bins) were self-reported as part of YouGov panel intake and match the 2022 Current Population Survey on gender, race, education, age, and region. Importantly, the panel is representative of U.S. adults, which is a significant upgrade over the 'top-N websites' sampling frame used in most prior tracking studies.

Data — tracking audits: All 64,074 unique domains were submitted to Blacklight for on-demand scanning. Blacklight uses headless browser automation, network request interception, and behavioral script analysis to detect: (1) ad trackers (via DuckDuckGo's 'Ad Motivated Tracking' blocklist), (2) third-party cookies (via Set-Cookie header analysis), (3) Facebook Pixel, (4) Google Analytics, (5) session recording scripts (behavior + known URL lists), (6) keylogging (by injecting known strings into form fields and monitoring network exfiltration of those strings), and (7) canvas fingerprinting (via canvas element inspection and pixel-level script output analysis). Successful scans covered 34,078 domains (53.2% of unique domains, 75.7% of visits). The higher visit-coverage fraction (76%) relative to domain-coverage (53%) reflects that heavily visited domains were disproportionately among those successfully scanned — a plausible selection pattern that partially mitigates the MCAR concern but is not formally modeled.

Exposure construction: Two complementary metrics are computed per user per tracker type. Cumulative Exposure (Equation 1) sums the binary tracker-presence indicator for each domain across all visits to that domain. Exposure Rate (Equation 2) divides cumulative exposure by total visit count, giving a per-visit average. This decomposition is methodologically important: it allows the authors to separate 'more exposure because more browsing' from 'more exposure because browsing more-tracked sites.' A concrete example: a user making 5,000 visits to domains averaging 1 third-party cookie each has the same Exposure Rate as a user making 500 visits to similar domains, but 10× higher Cumulative Exposure. The authors also compute an organization-level tracking share (Equation 4): for each user, the fraction of their total visits for which organization j is present as a third-party, using DuckDuckGo Tracker Radar to map ~38,000 third-party domains to ~19,000 parent organizations. A duration-weighted variant (Equation 5) is computed as a robustness check and yields similar results (Section D, not fully reproduced in the truncated text).

Demographic modeling: OLS regressions (Equation 6) with Huber-White robust standard errors are run separately for each of the seven tracker types plus the top-organization share, with all demographics as indicator variables (reference: male, White, high school or below, age 18–24). Two outcome versions (cumulative and rate) are estimated for each, yielding 16 regression models. Statistical significance is evaluated at p<0.1/0.05/0.01, with a Bonferroni correction applied for multiple comparisons across 12 demographic predictors (threshold p<0.00416); the authors report that all coefficients significant at p<0.01 in the unadjusted models survive this correction.

Time-to-first-encounter analysis: Using visit timestamps, the authors compute the empirical CDF of time until first encounter with each tracking technology, anchored at 6 PM on May 31 (the measurement start, chosen so that at least 50 users have logged activity). This produces the survival-curve-style Figure 1, which is read directly from tabulated values (e.g., 50.1% of users encountered an ad tracker within 12 hours; 79.1% within 48 hours).

Reproducibility: Replication materials including code are publicly available at https://github.com/themains/private_blacklight. The browsing panel data is proprietary to YouGov and not released; Blacklight scan results for the 34,078 domains may or may not be archived (the paper does not explicitly state this). The Blacklight scans were run at a single point in time, meaning the tracking state of a domain may have changed between the June 2022 visit and the scan date — this temporal mismatch is not quantified.

Technical innovations

User-centric rather than site-centric measurement of tracker exposure: prior work (e.g., Englehardt and Narayanan 2016 OpenWPM studies) measured what fraction of the top-1M sites deploy a tracker; this paper instead aggregates domain-level audit results weighted by actual visit frequency from a representative human panel, producing individual-level exposure distributions.
Decomposition of exposure into cumulative vs. rate-normalized metrics (Equations 1–2) to separate volume-driven from site-choice-driven surveillance risk — a distinction prior survey-based and site-crawl studies cannot make.
Organization-level tracking share computation (Equations 3–4) using DuckDuckGo Tracker Radar to map 38,000+ third-party domains to 19,000+ parent organizations, enabling measurement of browsing-history visibility to consolidated entities rather than individual tracker domains.
Linkage of a nationally representative passive behavioral panel (YouGov/RealityMine) to automated privacy audits (Blacklight), bridging the gap between behavioral data with no tracking metadata and crawler studies with no real-user behavioral signal.

Datasets

YouGov/RealityMine browsing panel — 6,297,382 visits, 64,074 unique domains, 1,132 U.S. adults, June 2022 — proprietary, not publicly released
Blacklight domain scans — 34,078 domains (53.2% of visited domains) — generated by The Markup's Blacklight tool; scan timing relative to June 2022 browsing period is unspecified
DuckDuckGo Tracker Radar — 38,000+ third-party domains mapped to 19,000+ organizations — publicly available at github.com/duckduckgo/tracker-radar

Baselines vs proposed

Site-centric prior work (implied baseline, e.g., Englehardt & Narayanan 2016): tracker prevalence measured as % of crawled sites — user-centric approach shows 99.6% of users encounter ad trackers vs. site-level rates that systematically underestimate individual exposure (exact prior number not quoted in text)
Cumulative exposure — Ad Trackers: mean=27,407 encounters/month (median=9,738) across 1,132 users
Cumulative exposure — Third-Party Cookies: mean=32,325 (median=11,757)
Cumulative exposure — Canvas Fingerprinting: 91.7% of users encounter at least one instance; mean=320 cumulative encounters
Cumulative exposure — Keylogging: 84.9% encounter at least one; mean=309
Cumulative exposure — Session Recording: 89.7% encounter at least one; mean=155
Top-organization tracking share (Google): median=54% of user's total browsing history observable by a single org; Google is dominant tracker for 1,007 of 1,132 users (89%)
Number of tracking organizations per user: median=242, range 1–618

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2512.24041.

Fig 1

Fig 1 (page 26).

Fig 2

Fig 2 (page 26).

Fig 3

Fig 3 (page 26).

Fig 4

Fig 4 (page 27).

Fig 5

Fig 5 (page 27).

Fig 6

Fig 6 (page 27).

Fig 7

Fig 7 (page 28).

Fig 8

Fig 8 (page 28).

Limitations

Temporal mismatch: Blacklight scans are run at a single point in time distinct from the June 2022 browsing period; tracking scripts on domains may have changed between visit and scan, introducing measurement error of unknown direction and magnitude.
MCAR assumption for missing Blacklight results: 46.75% of unique domains (covering ~24.3% of visits) returned no usable Blacklight scan. The assumption that these are missing completely at random is stated but not tested; if small/niche domains are less likely to scan successfully and also less likely to use advanced trackers, exposure estimates are biased upward for common trackers and estimates for rare trackers are uncertain.
Blacklight simulates a fresh, stateless, unauthenticated browser visit; it cannot capture tracking behavior that activates only for logged-in, returning, or profiled users — potentially underestimating keylogging and session recording on sites that deploy these selectively.
Single-month, single-device panel: the June 2022 window may not represent annual browsing patterns (seasonality), and RealityMine meters only one device per panelist, missing cross-device tracking and mobile app behavior.
Low R² values (≤0.07) across all demographic models indicate that measured demographics are poor predictors of individual tracking exposure; the paper interprets this as 'browsing habits dominate,' but omitted variable bias (e.g., browser type, ad-blocker use, VPN use) is not controlled for and could confound demographic coefficients.
No adversarial or evasion analysis: Blacklight's detection methods rely on known tracker lists and behavioral heuristics; trackers that obfuscate their scripts, use first-party proxying, or employ CNAME cloaking would be missed, meaning all exposure estimates are lower bounds.
Sample limited to U.S. adults who voluntarily installed metering software; self-selection into the panel (privacy-conscious users may opt out) could depress true exposure estimates, and results do not generalize to non-U.S. populations with different regulatory environments.

Open questions / follow-ons

How much does Blacklight's stateless-visit simulation underestimate tracker activation rates for authenticated or returning users, and can a credentialed-crawl methodology close this gap?
What fraction of the undetected tracking is attributable to CNAME cloaking and first-party script proxying — techniques known to evade blocklist-based detection — and how does including these change demographic exposure estimates?
Do privacy-protective behaviors (ad blockers, privacy-focused browsers, VPNs) interact with demographic variables to explain residual age and race disparities in per-visit exposure rates, and if so, is differential privacy-tool adoption itself a digital-divide phenomenon?
Given that Google dominates tracking share for 89% of users, how do changes in third-party cookie deprecation (Chrome's Privacy Sandbox rollout post-2022) empirically shift the distribution of dominant-organization tracking shares across the same user population?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, this paper is directly relevant as an empirical baseline for understanding what tracking signals are available in realistic human-user sessions. The finding that canvas fingerprinting is present on 6% of visits (reaching 91.7% of users over a month) and that per-visit fingerprinting rates are higher for older users and women is operationally significant: it means that canvas-based signals are broadly deployed and users are conditioned to them, but also that demographic skews in fingerprint-heavy site visits could introduce bias in behavioral classifiers trained on fingerprint features. If a bot-defense system uses canvas fingerprinting as a high-signal authenticity indicator, the demographic variation in baseline exposure documented here suggests the signal distribution is not uniform across user populations, which has implications for false-positive rates across age and gender cohorts.

The organizational concentration finding — a single entity observing a median 54% of browsing history — is also relevant to cross-domain bot-detection architectures that aggregate signals from multiple publisher properties. The paper confirms empirically that such cross-site linkage is feasible at scale and already practiced commercially, which validates threat models in which sophisticated bots attempt to exploit or evade organization-level tracking graphs. The 48-hour exposure windows (Figure 1) suggest that any 'fresh user' heuristic based on tracker encounter history will have high false-negative rates against bots that cycle identities faster than the typical organic user's first-encounter timeline for invasive trackers.

Cite

bibtex

@article{arxiv2512_24041,
  title={ Exposed: Shedding Blacklight on Online Privacy },
  author={ Lucas Shen and Gaurav Sood },
  journal={arXiv preprint arXiv:2512.24041},
  year={ 2025 },
  url={https://arxiv.org/abs/2512.24041}
}

Exposed: Shedding Blacklight on Online Privacy ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​