Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting

Source: arXiv:2506.22639 · Published 2025-06-27 · By Michael A. Specter, Mihai Christodorescu, Abbie Farr, Bo Ma, Robin Lassonde, Xiaoyang Xu et al.

TL;DR

This paper addresses the pervasive practice of device fingerprinting in the mobile app ecosystem, focusing on the role of third-party SDKs embedded in Android apps. Prior research has largely concentrated on fingerprinting on the web or on the presence of tracking in general; this work is distinguished by its scale and its focus on the SDK market enabling fingerprinting in native mobile apps. Leveraging a massive dataset of 228,598 SDKs from Maven repositories and 178,054 popular Android apps from the Google Play store, the authors combine manual and automated static analysis to detect fingerprinting behaviors through the exfiltration of device signals. They build a seed set of 14 self-declared fingerprinting SDKs, identify over 500 distinct signals collected, and then use static taint analysis to find 723 SDK families exhibiting similar fingerprinting-like behavior.

Key findings reveal that only about 30.56% of fingerprinting behaviors originate from ad-related SDKs, with nearly a quarter coming from SDKs with unclear or unknown purposes. Security and authentication SDKs account for only 11.7%. The signals used for fingerprinting vary widely: 504 unique signals were detected, with a very sparse distribution such that only 2% of APIs are used by more than 75% of fingerprinting SDKs. This diversity complicates detection and enforcement efforts that rely on permissions or API monitoring. Their manually developed labeling of SDK purposes, combined with unsupervised clustering of SDK fingerprinting behaviors, shows poor separability of ad SDKs from others, suggesting that existing industry anti-tracking policies focused mainly on advertising lack sufficient scope. Furthermore, fingerprinting SDKs are disproportionately popular and appear across multiple app categories, emphasizing the breadth of the phenomena.

Key findings

The seed set of 14 self-declared fingerprinting SDKs collectively exfiltrate over 500 distinct signals, with each SDK collecting on average 75.5 unique signals (Section 4.1).
Only 30.56% of fingerprinting behaviors are linked to advertising SDKs; 23.92% are from SDKs of unclear purpose, and 11.7% from security and authentication SDKs (Section 4.2).
A total of 723 likely fingerprinting SDK families were identified, spanning 14,178 SDK versions, by static taint analysis using a 500+ signal set (Section 4.2).
The set of exfiltrated fingerprinting APIs is highly sparse: only 2% of individual APIs are used by more than 75% of fingerprinting SDKs (Figure 2).
Cosine similarity between fingerprinting SDKs in terms of fingerprinting signals is low, with only two SDK pairs exceeding 0.5 similarity, limiting effectiveness of naive detection via signal overlap (Figure 3).
Fingerprinting SDKs that collect location signals show that 72% collect coarse-grained location, 71.6% collect fine-grained location, and 86.29% collect at least one type (Section 4.2).
Fingerprinting SDKs have approximately 10x greater install counts versus non-fingerprinting SDKs, indicating their high market penetration (Section 4.3).
t-SNE visualization of SDK fingerprinting behavior reveals multiple small clusters cutting across SDK purpose categories, indicating that ad SDKs and security SDKs cannot be easily separated based on fingerprinting signals alone (Figure 5).

Threat model

The adversary is a third-party SDK provider embedded inside Android applications who exfiltrates device signals to fingerprint user devices for tracking or fraud detection. The SDK is assumed to run inside the trusted application context with permissions granted to the host app. The adversary cannot evade static bytecode analysis by extreme obfuscation or code loading beyond the scope of the analysis pipeline. The threat model does not address runtime side-channel fingerprinting or out-of-band correlation. It focuses on detectable data flows from device APIs to network sinks as captured by static taint analysis.

Methodology — deep read

Threat Model & Assumptions: The study assumes adversaries as SDK providers who seek to fingerprint devices to track or identify users. The analysis focuses on static data flows within SDK code to detect signal exfiltration but does not seek to assign intent or motivation, adopting a conservative approach only labeling SDKs that exfiltrate at least as many signals as those self-admitted fingerprinters. The threat model explicitly excludes runtime or stealthy side-channel methods, focusing on detectable API-based fingerprinting through SDK code.
Data: The team collected 3,025,417 APKs from the Google Play Store over 18 months, filtering to 178,054 apps with over 10,000 active devices to ensure relevance. For SDKs, a custom crawler extracted 228,598 unique SDKs from nine Maven repositories, filtering out pre-release versions. Metadata from Maven and developer websites supplemented labeling.
Architecture & Analysis Pipeline: The study establishes a pipeline with these steps:

Identification of a seed set of 14 SDKs that openly declare fingerprinting capabilities in their marketing or documentation.
Manual reverse engineering of these SDKs to enumerate the specific device signals exfiltrated; over 500 unique signals were identified.
Development of an automated static taint analysis suite for Android bytecode, performing interprocedural, context-, field-, and object-sensitive taint tracking to detect exfiltration of fingerprinting signals.
Application of this taint analysis over all SDKs to detect those that exfiltrate signals comparable to the seed set, defining an extended set of 723 likely fingerprinting SDKs.
Manual labeling of these SDKs by expert coders via a collaborative codebook process to assign market categories: Advertising, Analytics, Security & Authentication, Tools/Other, or Unclear/Not Found.
SDK-to-application matching using a fine-grained code similarity method that compares identifiers, opcode frequencies, framework APIs, and strings to detect SDK presence inside APKs with controlled false-positive tuning.

Training Regime: This is primarily a static program analysis study without machine learning training but uses static taint analysis implemented and run on large scale. Details of parameters for t-SNE visualization are provided, but no ML optimization beyond labeling consensus is described.
Evaluation Protocol: The study evaluates the similarity and distribution of fingerprinting signals across SDK families. It reports inter-rater reliability (Krippendorff’s alpha = 0.804) for labeling. It leverages measures like cosine similarity, t-SNE visualization, and API usage prevalence. The evaluation is mostly descriptive on a held-out large-scale dataset of apps and SDKs. No adversarial examples or dynamic analysis experiments are reported.
Reproducibility: The paper does not mention releasing code, datasets, or analysis tools. The SDK and app datasets are proprietary or collected through external repositories that may not be publicly accessible. The methodology is described in detail but full replication depends on access to large codebases and specialized analysis tools.

Concrete Example: For instance, the Forter SDK in the seed set was manually reverse engineered to identify 69 signals it exfiltrates related to device properties. The taint analysis engine uses these as metadata tags to identify other SDKs that exfiltrate overlapping or greater signal sets. SDKs surpassing the minimum signal threshold (20 signals, lowest seed set SDK) were included in the extended fingerprinting set. These were then manually labeled and matched to 3,000+ apps with over 100 million installs using code similarity methods. The distribution and usage patterns were assessed at scale.

Technical innovations

A large-scale static taint analysis pipeline capable of detecting fingerprinting signal exfiltration across 228,000 SDKs and 178,000 Android apps.
Manual creation of a seed set of self-identified fingerprinting SDKs with over 500 unique signals, used as ground truth for automated detection.
Combining manual coding techniques from HCI research to robustly label unlabeled SDKs by market use case based on sparse metadata and code.
Use of a fine-grained code similarity metric aggregating system API calls, opcode frequencies, and string constants to detect SDK presence inside large APK datasets.
Demonstration that fingerprinting SDK signal distributions are highly sparse and diverse, challenging simple detection approaches based on API monitoring or permission enforcement.

Datasets

SDK dataset — 228,598 SDKs — collected by crawling 9 large Maven repositories (e.g., Maven Central, JCenter, Google)
Android app dataset — 178,054 APKs — filtered popular apps from Google Play Store with >10,000 active installs, collected Jan 2023 to May 2024
Seed set — 14 self-declared fingerprinting SDKs — manually identified from SDK documentation and reverse engineering

Baselines vs proposed

Seed set SDK average fingerprinting signals collected: 75.5 signals vs Extended set SDKs threshold: minimum 20 signals
Cosine similarity between seed set SDKs: max pairwise similarity just over 0.5 vs majority below 0.4, indicating poor overlap
Fingerprinting SDK install counts roughly 10x greater than non-fingerprinting SDKs (exact install numbers not specified)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2506.22639.

Fig 2

Fig 2: Map of signals collected by known fingerprinting SDKs in the Seed Set, with one dot per API exfiltrated. The top plot shows

Fig 3

Fig 3: Cosine similarity between Seed Set SDKs, each repre-

Fig 8

Fig 8: Static Taint Analysis – Identify Sources and Sinks

Fig 9

Fig 9: Static Taint Analysis – Taint Propagation

Fig 10

Fig 10: CoFlow Analysis

Fig 11

Fig 11: Fingerprinting detection using CoFlow. There are 2

Fig 12

Fig 12: Features of SDK code are not uniformly distributed

Limitations

Static analysis may miss fingerprinting via dynamic code loading, polymorphic code, or side channels like timing attacks.
No runtime or behavioral evaluation was performed; static taint results may over- or under-represent fingerprinting capabilities.
SDK labeling relies on manual metadata analysis; 23.92% of fingerprinting SDKs remain categorized as unclear meaning unknown functionality.
Study focuses exclusively on Android ecosystem and publicly available Maven repositories; iOS and non-Maven SDKs not covered.
No adversarial evaluation of SDK evasion or obfuscation techniques against detection pipeline.
Fingerprinting detection threshold based on signal count rather than entropy; may not capture more subtle fingerprinting requiring fewer signals.

Open questions / follow-ons

How effective would dynamic analysis or combined static-dynamic approaches be at detecting stealthy fingerprinting behaviors beyond static taint analysis?
What privacy-preserving alternatives exist for SDKs requiring fingerprinting signals for legitimate anti-fraud, analytics, or security reasons?
Can richer automated classification methods reliably distinguish advertising SDKs from security or analytics SDKs based on fingerprinting behavior?
How do fingerprinting SDK prevalence and behavior compare across platform ecosystems, e.g. iOS vs Android?

Why it matters for bot defense

This work provides bot-defense engineers and CAPTCHA practitioners with a detailed understanding of the third-party SDK ecosystem that enables device fingerprinting within mobile apps. Since fingerprinting often underpins risk assessment and bot detection, comprehending which SDKs are likely to fingerprint devices, their market prevalence, and the diversity of signals collected is essential for designing effective defensive strategies. The paper highlights challenges in relying solely on permissions or API monitoring to prevent fingerprinting due to signal sparsity and diversity, emphasizing a need for more nuanced detection and policy enforcement mechanisms. Practitioners should be mindful that focusing anti-fingerprinting efforts only on ad SDKs will miss a substantial fraction of SDK-based fingerprinting occurring in diverse app categories and SDK types, including poorly documented or unclear SDK code. The data on high install counts of fingerprinting SDKs underscores the scale of exposure users have to fingerprinting and motivates integrating SDK behavior profiling into comprehensive bot-defense tooling. However, since no dynamic or real-time evaluation was performed, practitioners should consider complementing static approaches with runtime monitoring for robust detection in production systems.

Cite

bibtex

@article{arxiv2506_22639,
  title={ Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting },
  author={ Michael A. Specter and Mihai Christodorescu and Abbie Farr and Bo Ma and Robin Lassonde and Xiaoyang Xu and Xiang Pan and Fengguo Wei and Saswat Anand and Dave Kleidermacher },
  journal={arXiv preprint arXiv:2506.22639},
  year={ 2025 },
  url={https://arxiv.org/abs/2506.22639}
}

Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​