Shy Guys: A Light-Weight Approach to Detecting Robots on Websites
Source: arXiv:2603.28546 · Published 2026-03-30 · By Rémi Van Boxem, Tom Barbette, Cristel Pelsser, Ramin Sadre
TL;DR
This paper addresses a practical gap in bot defense: many websites need something cheaper and less intrusive than CAPTCHAs, but more effective than simple user-agent blacklists. The authors’ core idea is to combine two passive signals already present in ordinary server logs: whether a client requests the site favicon, and whether the user-agent string looks like a real modern browser versus a crawler, headless client, or forged string. The premise is that legitimate browsers usually fetch favicons and follow recognizable user-agent conventions, while many scrapers and automated agents do neither.
The main result is that this lightweight heuristic stack can clear most human traffic while catching a meaningful fraction of bots. On a multi-source dataset of 4,594,072 requests and 54,945 unique user-agent strings, they report 67.6–67.7% bot detection with a 3.0% false-positive rate, far above the comparison baselines they cite (Cloudflare bot management 8.4%, CrawlerDetect 18.1%, Matomo DeviceDetector 12.0%, known-bots IP lists 18.0%). The paper’s strongest point is not perfect bot separation, but a low-friction first-pass filter that can route only ambiguous traffic to active challenges.
Key findings
- The evaluation corpus contains 4,594,072 requests and 54,945 unique user-agent strings collected from honeypots and third-party partner sites hosted across multiple regions (Japan, USA, Europe, etc.).
- The proposed combined heuristic achieves 67.6% bot true positive rate with 3.0% false-positive rate in the human-vs-bot comparison table (Table III).
- Cloudflare Bot Management is reported at 8.4% bot detection on the same comparison set, versus 67.6% for the proposed method (Table III).
- CrawlerDetect reaches 18.1% bot detection, Matomo DeviceDetector 12.0%, and known-bots IP lists 18.0%, all substantially below the proposed method (Table III).
- Among 54,945 unique user-agents, the user-agent analysis flags 373 as explicit bots using regex + bot lists, and those 373 account for about 25.6% of total requests.
- For stealthy bot detection, 51,268 unique user-agents (93% of all unique UAs) are flagged by the user-agent-coherence rules; 48,646 of those are flagged by the deprecated-browser heuristic and 27,496 by the deprecated-OS heuristic.
- In the favicon validation study, daily unique IPs requesting the favicon and daily unique authenticated IPs issuing POSTs to /course/ show no significant difference (paired t-test t(10) = -0.981, p = 0.350, Cohen’s d = -0.3) and strong correlation (Pearson r = 0.87, p < 0.001).
- Only 351 of 3,185 unique IPs in the bot/honeypot dataset (11%) requested robots.txt, and only 264 (8%) requested the favicon, suggesting most automated clients did not behave like normal browsers.
Threat model
The adversary is an automated web client that may spoof its user-agent, ignore or selectively obey robots.txt, use headless browsers, and route traffic through ordinary web infrastructure to resemble humans. The defender assumes access only to standard server logs and no client-side interaction. The attacker is not assumed to have perfect mimicry of contemporary browser user-agent conventions across time, nor to perfectly reproduce favicon-fetch behavior; however, the paper acknowledges that stronger adversaries can evade some rules, so the method is intended as a first-line filter rather than a complete solution.
Methodology — deep read
Threat model and assumptions: the attacker is an automated client, often a crawler or scraper, that may spoof its user-agent string and may or may not obey robots.txt. The authors explicitly assume the defender is limited to standard web server logs and wants a passive first-pass detector, not a full browser challenge. They also acknowledge a practical asymmetry: false positives are more costly than false negatives, so the system is tuned to preserve legitimate users’ experience and send only uncertain traffic to stronger checks. Their implicit adversary can vary UA strings and can use headless browsers or libraries, but the method assumes many such clients still leak weak signals in logs.
Data and preprocessing: the paper uses multiple log sources collected starting in July 2024 over several years. The main evaluation corpus has 4,594,072 requests and 54,945 unique user-agent strings. Sources include two honeypots accessible by domain name, IPv4, and IPv6, plus third-party partner logs including an LMS. The raw logs come from Caddy, Apache httpd, NGINX, and HAProxy formats; they are anonymized with Crypto-PAn before parsing into CSV. The authors say the logs contain only HTTP headers and metadata, with no enrichment from external identity data. For the favicon study, the ground truth is an estimate of authenticated users based on LMS POSTs to /course/: authenticated users receive 200, unauthenticated users 404, so daily counts of unique IPs issuing POSTs are used as a proxy for real users. To reduce cache-induced false negatives for favicon requests, the favicon URL is changed daily via /favicon.ico?v=DATE.
Architecture / algorithm: the method is actually two independent heuristics. First, favicon-based detection treats a favicon request as evidence of a likely non-bot, since many scraping bots using headless browsers do not render pages and therefore omit favicon retrieval. The authors explicitly discuss a known failure mode: if the browser caches the favicon and later changes IP, the server may not see a new request, causing a false positive; they mitigate this in the experiment by forcing daily refetches. Second, user-agent analysis is split into “good bots” and “stealthy bots.” Good bots are identified by searching the UA string for bot-related keywords using the regex rules from the Python user-agents library (“py-ua”) and by checking membership in ai-robots.txt (“robots.json”). Stealthy bots are flagged if the UA does not start with Mozilla/5.0, if it contains browser/OS versions that have been deprecated for roughly two years, or if it violates the reduced-UA conventions modern browsers should use. Algorithm 1 is a simple OR-composition: explicit bot keywords/list membership, then non-Mozilla prefix, then deprecated version check, otherwise human.
Training regime and implementation details: there is no learned model, so there are no epochs, batch sizes, optimizer settings, or random seed strategies. The paper is rule-based and operational rather than statistical-learning based. The implementation details that matter are the log formats supported, the anonymization step, and the daily favicon URL rotation. Where the paper becomes empirical, it works through concrete examples: for instance, an UA like Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0) is treated as highly suspicious because IE 8 is long obsolete and the same exact string accounts for 93,610 requests (2% of total). Similarly, the paper shows that many claimed Android, Firefox, and Chrome versions cluster around obsolete releases that would be unusual for real users today.
Evaluation protocol and metrics: the authors validate the favicon heuristic against LMS traffic by comparing daily unique favicon-requesting IPs with daily unique authenticated POSTing IPs. They report a paired t-test, Cohen’s d, and Pearson correlation across 11 daily observations (Figure 2). For user-agent analysis, they separately evaluate the explicit-bot detector (regex + curated list) and the stealth-bot coherence rules. They compare their method against Cloudflare’s certified-bot directory, CrawlerDetect, Matomo DeviceDetector, and known-bots IP lists using a combined ground-truth dataset built from the LMS (human traffic) and honeypot (bot traffic). In that comparison (Table III), metrics are row-normalized by class: bot-class TP/FN and human-class FP/TN. They also report overlap counts among detection sources (Table II) and user-agent distributions for Android, Firefox, and Chrome version claims (Figures 3 and 4). One useful concrete example is the LMS favicon experiment: over the 11-day observation window, the daily favicon-IP count tracks the daily authenticated POST-IP count closely enough that the authors argue favicon retrieval is a reasonable noninvasive proxy for real users.
Reproducibility: the paper indicates the use of standard log formats and open-source components like py-ua and ai-robots.txt, but the excerpt provided does not include a code release, frozen evaluation dataset release, or exact implementation repository. The data themselves appear to be partly third-party and anonymized, which likely limits public reproducibility. The authors do provide enough detail to reimplement the rule set, but not enough in the excerpt to fully reconstruct the merged labeled dataset or the exact time windows for every source.
Technical innovations
- Combines favicon presence with user-agent coherence rules to build a passive first-pass bot filter using only standard server logs, rather than JavaScript challenges or CAPTCHAs.
- Uses a two-stage UA detector: explicit bot identification via regex/list membership, then stealth-bot detection via Mozilla/5.0 prefix checks and deprecated browser/OS/version heuristics.
- Validates favicon requests as a proxy for legitimate authenticated users by comparing them against daily LMS POST activity and showing strong correlation without significant mean difference.
- Shows that many bots can be surfaced by user-agent incoherence even when they do not self-identify, outperforming list-based and off-the-shelf detectors on the merged evaluation set.
Datasets
- Honeypot logs — part of 4,594,072 total requests — self-hosted honeypots accessible via domain name, IPv4, and IPv6
- LMS logs — part of 4,594,072 total requests — third-party university learning-management-system logs
- Partner website logs — part of 4,594,072 total requests — websites hosted in Japan, USA, Europe, and elsewhere
Baselines vs proposed
- Our method: bot detection = 67.6% vs Cloudflare Bot Management = 8.4% (Table III)
- Our method: false-positive rate = 3.0% vs Cloudflare Bot Management = 0.1% (Table III)
- CrawlerDetect: bot detection = 18.1% vs proposed = 67.6% (Table III)
- Matomo DeviceDetector: bot detection = 12.0% vs proposed = 67.6% (Table III)
- Known-bots IP list: bot detection = 18.0% vs proposed = 67.6% (Table III)
Limitations
- The method is heuristic-based and likely easier to evade than behavior-based or challenge-based systems; a sophisticated attacker can mimic Mozilla-style strings and modern version patterns.
- The 3.0% false-positive rate is low but nonzero, and the paper explicitly notes that misclassifying a real user is worse than missing a bot in this deployment model.
- The favicon validation relies on a specific LMS proxy ground truth and daily URL rotation; in sites where favicon caching or IP hopping is different, performance may differ.
- The combined human-vs-bot ground truth is constructed from two sources (LMS humans, honeypot bots) rather than a single fully labeled dataset, which can bias evaluation toward cleaner separation than real production traffic.
- The excerpt does not provide a public code release or full dataset release, so exact reproducibility appears limited from the text provided.
- The “deprecated browser/OS” heuristic may age poorly as browser reduction, long-term support policies, and enterprise device fleets evolve.
Open questions / follow-ons
- How well do the favicon and UA-coherence heuristics hold under modern browser user-agent reduction and rapidly changing device/browser fleets over time?
- Can the passive rules be combined with TLS fingerprints or session-level signals without materially increasing privacy risk?
- What is the evasion cost for attackers who intentionally mimic Mozilla/5.0, use fresh browser versions, and request favicons like humans?
- How should thresholds be adapted for sites with heavy NAT, mobile IP churn, or aggressive CDN caching where favicon/IP correlation is weaker?
Why it matters for bot defense
For a bot-defense engineer, the useful takeaway is that a cheap passive prefilter can substantially reduce how often you need to challenge traffic. This paper suggests that favicon presence and user-agent coherence are good triage signals: clear humans early, route obvious bots to blocking, and reserve CAPTCHAs or JS challenges for the smaller ambiguous remainder. In practice, that means you could implement this inside log-processing or edge-request classification without instrumenting the client, which is attractive for sites that want to minimize friction.
The caveat is that this is not a robust standalone bot classifier. It works best as a front-end to stricter controls, and it will need periodic maintenance as browser UA conventions change and as sophisticated crawlers adapt. If you already operate a challenge system, this paper argues for using passive heuristics to reduce challenge volume and preserve UX, not to replace active defenses entirely.
Cite
@article{arxiv2603_28546,
title={ Shy Guys: A Light-Weight Approach to Detecting Robots on Websites },
author={ Rémi Van Boxem and Tom Barbette and Cristel Pelsser and Ramin Sadre },
journal={arXiv preprint arXiv:2603.28546},
year={ 2026 },
url={https://arxiv.org/abs/2603.28546}
}