Detecting Bot Detection: Prevalence, Techniques, and Implications for Web Measurement Research

Source: arXiv:2606.14525 · Published 2026-06-12 · By Ralf Gundelach, Michael Mühlhauser, Dominik Herrmann

TL;DR

This paper investigates the overlooked impact of bot detection and blocking on large-scale web measurement research that relies heavily on automated browser frameworks. While prior work has identified the presence of bot detection scripts, this study quantifies actual blocking-induced sample loss and its prevalence across popular websites. Through a literature survey of 81 recent top-tier security, privacy, and web measurement papers, the authors find that 83% omit any discussion or quantification of bot detection blocking, revealing a serious transparency gap that threatens the validity of empirical web measurements.

To fill this gap, the authors conduct a large-scale measurement study on the top 10,000 websites from the Tranco Top 1M list and crawl each site in four browser configurations (Chromium and Firefox, both headless and headed), totaling 40,000 page visits. They find that 82% of observed blocking is due to bot detection, predominantly from CDN providers with integrated bot detection services such as Cloudflare (37% block rate) and Akamai (26%). Chromium headless suffers a particularly high blocking rate of 15%, over double non-headless conditions. By instrumenting and intercepting JavaScript API accesses, the paper develops a taxonomy of detection techniques and shows that header-level signals alone cause 75% of Chromium headless blocks. However, JavaScript probing for automation artifacts is more widespread than blocking rates suggest, indicating undetected probing and selective blocking.

The results reveal systematic, provider-correlated sample loss in automated crawling experiments that almost all prior work fails to measure or report. This introduces a meaningful threat to the integrity and generalizability of web measurement studies relying on automated crawlers. The paper’s dataset, code, and literature coding are publicly available, facilitating further research.

Key findings

83% of 81 surveyed top-tier web measurement papers omit any discussion of bot detection blocking (C7b).
Chromium headless browser condition experiences a 15% soft block rate vs. 7% average of other browser/display modes.
82% of blocking events are attributable to bot detection: 59% confirmed by vendor headers, 23% inferred from condition-dependent blocking.
Cloudflare and Akamai dominate bot detection-driven blocking at 37% and 26% block rates respectively on their hosted sites in the top 10K.
Header spoofing experiments show 75% of Chromium-headless-only blocks arise solely from HTTP header signals.
46% of crawled sites probe JavaScript properties associated with automation detection, indicating widespread environment probing beyond blocking.
Among 1 million domains in Tranco, 40% of resolved A records belong to Cloudflare infrastructure, which includes default bot detection on free and paid tiers.
Only 5% of surveyed papers explicitly quantify or discuss bot detection-related blocking in published web measurement studies.

Threat model

The adversary is a website deploying automated bot detection infrastructure aimed at distinguishing and blocking automated browser crawlers used by researchers or scrapers. The adversary can access HTTP headers, perform in-browser JavaScript environment probing (e.g., fingerprinting automation-specific artifacts), and impose blocking responses. The adversary cannot subvert the research crawler’s internal instrumentation or impact scanning outside standard web interactions. The adversary also does not actively attempt advanced evasion such as adversarial JavaScript or dynamic puzzle challenges beyond standard blocking techniques.

Methodology — deep read

The authors first define their threat model as automated web crawlers (researchers) probing popular websites, subject to blocking by site operators via bot detection techniques. The adversary is the website/server deploying bot detection infrastructure, which probes browser artifacts and network metadata to identify and block automated browsers. The researchers assume no mitigation or stealth beyond default automation frameworks.

Data provenance involves the Tranco Top 1 Million domain list snapshot from February 25, 2026. From this list, they sample the top 10,000 domains, deterministically shuffled with a documented seed to ensure reproducibility. Each domain was visited under four distinct automated browser configurations: Chromium headless, Chromium headed, Firefox headless, and Firefox headed, producing 40,000 total page visits collected between February 27 and March 3, 2026.

To detect bot detection activity, the authors developed novel JavaScript instrumentation at page load that uses recursive proxies and prototype patching to intercept all property accesses and function calls on the window object and related prototypes, across 132 JavaScript properties linked to automation detection or fingerprinting. Accesses to these properties were recorded with full property paths, operation types, timestamps, and contextual script metadata. They further deployed honeypot bait properties injected invisibly to detect attempts to probe automation-specific signals.

Execution utilized Playwright 1.57.0 with Chromium 143.0 and Firefox 144.0 browsers, running with default Playwright automation settings and unmodified User-Agent strings. IP rotation was performed via batch VM snapshots on Hetzner Cloud infrastructure in Nuremberg, Germany to randomize IP addresses and minimize confounds from IP reputation effects. The crawl protocol was: page navigation with 30s timeout, a wait until no new JS property access for 2s or max 30s, then screenshot capture and scan termination.

Blocking detection was performed by analyzing HTTP response statuses for soft block codes (403, 429, 503) and by correlating detected automation signals with blocking outcomes. Vendor confirmation of blocking was based on embedded headers identifying Cloudflare, Akamai, or other services. Header spoofing experiments were conducted on March 3, 2026, altering User-Agent headers on Chromium headless to determine the causal role of header-level signals in triggering blocks.

Evaluation used paired within-site comparisons across the four conditions, controlling IP and temporal effects. Metrics reported include overall block rates per condition, fraction attributable to bot detection sources, prevalence of specific JS detection signal accesses, and distribution across CDN providers. The literature survey coded 81 papers on 12 reproducibility and bot detection criteria with high inter-rater agreement (Cohen’s κ=0.90).

The authors release code, datasets, and literature coding notebooks publicly but exclude site screenshots for copyright reasons. The detailed instrumentation approach enables causal attribution of blocking to specific signals and vendor services, supporting robust inferences about blocking prevalence and its impact on measurement validity.

One example end-to-end: For each site under Chromium headless, the injected instrumentation proxies intercept all JS accesses before page scripts run, recording every detected bot-detection probe such as accesses to navigator.webdriver or window._selenium. When the site returns a 403 block with Cloudflare headers, the experiment identifies this as bot detection-induced blocking related to the Chromium headless automation fingerprint. The spoofing experiment then modifies the User-Agent header to verify that this header-level signal alone triggers the block, confirming the causal mechanism.

Technical innovations

Comprehensive JavaScript instrumentation using recursive proxies and prototype patching to intercept all property accesses related to bot detection without relying on predefined API lists.
Taxonomy of bot detection signals categorized into three confidence tiers based on specificity to automation detection (no legitimate use, strong indicator, general fingerprinting).
Empirical measurement of bot detection impact on crawling focusing on blocking outcomes rather than mere detection script presence, across 10,000 sites and four browser configurations.
A causal header spoofing experiment isolating the contribution of HTTP header signals to blocking, establishing 75% of Chromium-headless blocking arises from header-level indicators.

Datasets

Tranco Top 1 Million (Feb 25, 2026 snapshot) — 1,000,000 domains — public
Top 10,000 subset of Tranco Top 1M for active crawling experiment — 10,000 domains — public

Baselines vs proposed

Chromium headless block rate: 15% vs other conditions average block rate: 7%
Cloudflare-hosted sites block rate: 37% vs overall average block rate of sites under study
Akamai-hosted sites block rate: 26% vs providers without default-on bot detection range: 0–16%
Web measurement papers reporting blocking: 5% vs omitting: 83%

Limitations

Study conducted from a single datacenter IP range (Hetzner, Germany), which introduces potential confound from IP reputation; residential IPs or other regions might differ.
Only four browser configurations (Chromium/Firefox, headless/headed) were tested, leaving out other browsers or evasion techniques.
No active evasion or stealth techniques were applied beyond default Playwright automation; results represent baseline detectability.
Site interactions beyond homepage load were not performed, missing dynamic or user-triggered bot detection expansions.
Screenshots are excluded from public release for copyright reasons, limiting some qualitative analyses.
The impact of bot detection on downstream measurement outcomes (e.g., content differences, data bias) is not quantified and left for future work.

Open questions / follow-ons

What is the quantitative downstream effect of bot detection blocking on the validity and representativeness of specific web measurement outcomes?
How effective are various evasion or stealth techniques in bypassing prevalent CDN-integrated bot detection services in practice?
How do bot detection blocking rates and techniques vary across different geographic vantage points and residential IP addresses vs datacenter IPs?
Can dynamic crawler interaction models (beyond static homepage loads) evade or reduce bot detection-induced blocking?

Why it matters for bot defense

For bot-defense engineers and CAPTCHA practitioners, this study highlights a systemic, yet under-acknowledged, issue: bot detection creates large-scale, provider-correlated sample loss that silently biases web measurement and crawling efforts. Understanding that major CDN providers such as Cloudflare and Akamai dominate blocking, mostly triggered by header-level signals and pervasive JavaScript environment probing, can inform the design of defensive infrastructures with clearer signals and measurement transparency.

Practitioners building or evaluating bot detection systems should consider not only detection presence but also the real-world impact of blocking on legitimate automated clients. The study’s taxonomy of detection signals and empirical blocking rates provide empirical grounding to evaluate detection policies’ collateral impact on benign crawlers. Furthermore, the demonstrated gap in reporting blocking and evasion techniques in academic measurement research argues for better standardized documentation and reproducibility practices, which bot-defense teams can help establish to improve community-wide rigor.

Cite

bibtex

@article{arxiv2606_14525,
  title={ Detecting Bot Detection: Prevalence, Techniques, and Implications for Web Measurement Research },
  author={ Ralf Gundelach and Michael Mühlhauser and Dominik Herrmann },
  journal={arXiv preprint arXiv:2606.14525},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14525}
}

Detecting Bot Detection: Prevalence, Techniques, and Implications for Web Measurement Research ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​