Skip to content

From Third-Party to First-Party: Measuring and Protecting Against Modern Web Tracking Mechanisms

Source: arXiv:2606.16720 · Published 2026-06-15 · By Christian Böttger, Tareq Khouja, Norbert Pohlmann, Nurullah Demir, Tobias Urban

TL;DR

This paper addresses the emerging shift in web user tracking from traditional third-party tracking techniques to first-party tracking (FPT) and server-side tracking (SST). These newer mechanisms relocate tracking code into a website's first-party context or backend server, which obscures data flows and renders common client-side detection tools and URL-based blocklists less effective. The authors develop a novel, provider-agnostic methodology leveraging large-scale web crawl data focusing on first-party cookies and JavaScript behavioral analysis to detect and characterize modern tracking deployments at scale.

By analyzing 758,960 pages from 25,000 sites using heuristic-based identification of long-lived, unique first-party cookies and similarity clustering of tracking-related JavaScript, they quantify that over 54% of sites now deploy FPT or SST. The tracking ecosystem is found to be centralized heavily around a few major players such as Google, who operate code across both third-party and first-party contexts. Importantly, the authors demonstrate that widely used filter lists fail to block the majority of first-party tracking requests but by developing 181 new blocking rules derived from statistical patterns in first-party tracking URLs, they achieve a 63% increase in blocking effectiveness with minor page breakage. This work thus provides both an empirical baseline and practical defense mechanisms for the bot-defense and privacy communities to address modern evasive tracking.

Key findings

  • Over 54% of analyzed 13,187 websites deploy first-party tracking (FPT) or server-side tracking (SST) techniques as indicated by persistent first-party cookies.
  • Identified 6,249 distinct cookie names potentially holding tracking user IDs after applying novel heuristic filters based on longevity, length, uniqueness, and value similarity.
  • Collected and clustered over 6 million JavaScript files interacting with FPT cookies, finding 26,606 distinct FPT-related scripts (SimHash-based), with 45% delivered in first-party context and 58% in third-party context; 3.5% identical scripts served in both contexts.
  • Generated 24,914 clusters of similar FPT scripts linked to 492 distinct tracking entities, with just 10 clusters accounting for 44% of script URLs, showing a heavy-tailed ecosystem centralized around a few dominant vendors like Google.
  • Attribution method assigned 37% of clusters to known trackers using WhoTracksMe database, covering 76% of all tracking script URLs, highlighting substantial tracking without visible third-party scripts.
  • Current widely used filter lists (EasyList, EasyPrivacy) are largely ineffective against first-party tracking; proposed new blocking rules based on first-party tracking query parameter patterns block 63% more requests than traditional lists with only minor page breakage.
  • Script similarity analysis across regions (EU and US) showed a core common set of tracking scripts globally deployed alongside regional variants.

Threat model

The adversary is a web tracking provider or site operator who relocates traditional third-party tracking logic into first-party contexts or server-side components to evade client-side browser privacy controls and URL-based blocking lists. The tracker can control JavaScript delivered to sites and set or access first-party cookies with unique IDs. However, the adversary cannot be observed inside backend servers and cannot prevent the client from running blocking tools or inspecting cookies. They aim to remain covert and bypass traditional third-party tracking defenses.

Methodology — deep read

The authors start with the threat model of a web tracker adversary relocating tracking logic from third-party domains into first-party contexts or servers, to evade browser privacy protections and blocklists. Assumptions include ability of trackers to set first-party cookies and deliver JavaScript tracking code indistinguishable from site-owned scripts, but no visibility inside backend servers.

Data was collected at scale using the MultiCrawl measurement framework built on OpenWPM, visiting 25,000 sites randomly sampled from the Tranco list (top 5k plus samples in other rank ranges). Each site was crawled on 25 pages, resulting in 758,960 pages and over 3 TB of data including HTTP requests, responses, cookie headers, and downloaded JavaScript files. To capture realistic first-party tracking with consent, the crawler used a modified Consent-O-Matic extension to automatically accept all cookies.

Preprocessing focused on first-party cookies, excluding third-party cookies outright. A rule-based heuristic adapted from prior third-party tracking studies identified potential tracking cookies by requiring (1) longevity >90 days (non-session), (2) minimum length ≥8 bytes, (3) uniqueness of values across crawl profiles within 25% length similarity, and (4) value similarity less than 60% by Ratcliff/Obershelp string metric. This heuristic was validated manually on 100 sampled cookies and against the CookieGraph dataset, achieving about 70% overlap while excluding session cookies misleadingly used for tracking. Known trackers were identified by matching to EasyList, EasyPrivacy, and WhoTracksMe databases.

For the tracking JavaScript scripts, 6.28 million JS files were collected, and 639,947 of those interacted with identified potential first-party tracking cookies. The analysis used SimHash fingerprinting (64 bits, Hamming distance ≤8) to cluster scripts by near-duplicate code similarity. This generated 24,914 clusters. Clusters were linked to tracking entities by attribution to third-party providers (e.g., Google, Facebook) using the WhoTracksMe database based on script source URLs from third-party contexts.

To defend against FPT, the authors developed a statistical URL pattern mining approach on the query parameters of first-party tracking requests to generate 181 new ad-blocking rules compatible with EasyList format. These were evaluated by comparing block rates and page breakage against traditional filter lists on held-out crawls.

Training or learning in the ML sense was not applied; the approach was heuristic and measure-driven, supported by manual validation and grounded in prior literature on cookie classification. Evaluation metrics included prevalence of tracking cookies, script reuse distributions, cluster attribution rates, intersections across regions, and effectiveness of blocking rules measured as percentage request reduction and subjective page breakage impact. Multiple measurement runs across two geographic locations (EU and US) ensured statistical robustness.

A concrete end-to-end example: the crawler visits a site with FPT, collecting a first-party cookie with a long lifespan >90 days and unique value pattern. The JS loaded on that site accesses that cookie. The JS's SimHash is computed and clustered with other near-duplicates across the dataset. The cluster is linked to Google via WhoTracksMe as the script source. The blocking rules derived from analyzing query parameters seen in this cluster optimize blocking of these otherwise undetected first-party tracking requests using EasyList-compatible rule syntax. This shows how detection and protection combine at scale.

Technical innovations

  • Provider-independent heuristic and methodology to detect first-party tracking cookies using a combination of longevity, length, uniqueness, and similarity filtering adapted from third-party tracking literature.
  • Large-scale code similarity clustering of JavaScript tracking scripts using SimHash and SimHashIndex to identify shared tracking infrastructure across first- and third-party contexts.
  • Attribution of clustered scripts to known tracking providers by linking third-party script URLs in otherwise opaque first-party served code using WhoTracksMe.
  • Statistical mining of first-party tracking query parameter patterns to generate 181 new ad-blocking rules that substantially increase blocking coverage of FPT requests compared to traditional filter lists.
  • Demonstration that first-party tracking scripts often exactly replicate third-party code, signifying migration of traditional third-party tracking into first-party contexts.

Datasets

  • Tranco Top 25k Sites — 25,000 sites with 25 pages each — publicly available Tranco list (Apr 2025)
  • JS Scripts Collected — 6,280,920 JavaScript files from measurement crawl — proprietary raw crawl data
  • Cookie Data — 477,231 first-party cookies observed during crawl — proprietary raw crawl data
  • WhoTracksMe database — tracking entity mappings — publicly available trackerdb dataset
  • EasyList & EasyPrivacy — ad-blocking filter lists — publicly available

Baselines vs proposed

  • EasyList & EasyPrivacy blocking: baseline block rate for first-party tracking requests (exact number not stated)
  • Proposed blocking rules: block 63% more first-party tracking requests than EasyList/EasyPrivacy baseline
  • Clustering: 42% of script clusters attributed to known trackers, covering 76% of tracking script URLs
  • Script similarity: 3.5% of tracking JS scripts identical in first- and third-party context, showing migration
  • Manual heuristic validation: 70% overlap with CookieGraph dataset; heuristic identified 0.14% of cookies as tracking vs 0.09% by Cookiepedia

Limitations

  • Measurement only captures client-side indicators of first-party tracking; server-side communication is unobservable.
  • Heuristic cookie classification risks false positives/negatives due to lack of ground truth for many cookies.
  • Attribution relies on WhoTracksMe, which is incomplete and may miss some trackers or misattribute clusters.
  • Blocking rule evaluation does not include extensive user experience or broken-site testing beyond limited page breakage reports.
  • No analysis of evolution over time or impact of different privacy regulations (e.g., GDPR) on FPT deployment.
  • Analysis focused on desktop Firefox configured to block third-party cookies; results may differ on other browsers or configurations.

Open questions / follow-ons

  • How do different browser privacy features (e.g., Intelligent Tracking Prevention in Safari) impact first-party tracking prevalence and detection?
  • Can machine learning or dynamic behavioral analysis improve identification of first-party tracking beyond static heuristics and script similarity?
  • What are the longitudinal trends in migration from third-party to first-party tracking over time and across regulatory regions?
  • How can user experience be preserved while blocking increasingly stealthy first-party tracking techniques, especially on complex modern web apps?

Why it matters for bot defense

For bot-defense and CAPTCHA engineers, this work highlights a critical evolution in the tracking landscape toward first-party and server-side mechanisms that evade classical URL and third-party tracking-based defenses. Since bots and fraudulent actors increasingly exploit tracking signals for behavior analysis, relying solely on third-party cookie or script detection is insufficient. The developed heuristic identifies long-lived first-party cookies that serve as persistent identifiers, and the similarity clustering approach reveals that large trackers like Google distribute identical tracking scripts within first-party domains, making detection harder.

The proposed blocking rules derived from query parameter analysis illustrate a practical path forward to disrupt these opaque tracking flows without extensive page breakage. Bot-defense tools should consider incorporating similar first-party-aware heuristics and signature detection to monitor and potentially block suspicious tracking cookies or scripts that contribute to fingerprinting and user profiling. Moreover, understanding that first-party tracking can obscure client-server data flows informs CAPTCHA design to rely less on client-exposed signals alone and incorporate server-side validations. This paper provides empirical baselines and methodological blueprints for adapting defenses to this shifting threat landscape.

Cite

bibtex
@article{arxiv2606_16720,
  title={ From Third-Party to First-Party: Measuring and Protecting Against Modern Web Tracking Mechanisms },
  author={ Christian Böttger and Tareq Khouja and Norbert Pohlmann and Nurullah Demir and Tobias Urban },
  journal={arXiv preprint arXiv:2606.16720},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.16720}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution