To Unpack or Not to Unpack: Living with Packers to Enable Dynamic Analysis of Android Apps

Source: arXiv:2509.16340 · Published 2025-09-19 · By Mohammad Hossein Asghari, Lianying Zhao

TL;DR

This paper tackles a practical failure mode in Android security analysis: modern commercial packers increasingly break the usual workflow of unpacking an app and then instrumenting the recovered code with tools like Frida or Xposed. The authors argue that unpacking is no longer a reliable general solution because commercial packers have evolved anti-unpacking and anti-instrumentation defenses, and because many apps now rely on native code or on-demand loading that makes dumped Dex incomplete or unusable. Their alternative is Purifire, a kernel-level evasion engine built on eBPF that tries to keep the packed app intact while selectively hiding or neutralizing the analyst’s presence so dynamic analysis can proceed in situ.

The paper combines two pieces of work: a prevalence study over 12,341 real-world apps and a systems paper on a new bypass framework. The prevalence study shows that packing is not niche in the Chinese app ecosystem, where nearly half of apps are packed, and that unpacking tools are widely ineffective against the modern packers they sampled. Purifire’s evaluation is more qualitative than benchmark-style, but the authors report that it can bypass common anti-analysis checks and meaningfully improve the yield of prior runtime analysis projects, such as detecting more device fingerprints and other items that were previously suppressed by packer defenses.

Key findings

On the full 12,341-app corpus, 4,735 apps were detected as packed, i.e. 38% overall; the effect is heavily skewed by market, with 4,652/7,913 Chinese apps (58.8%) packed versus 83/4,428 non-Chinese apps (2%).
The authors say this is the first large-scale prevalence study of Android packers at this scale, and they split the corpus into 7,913 Chinese apps from 360 App Store and 4,428 non-Chinese apps from APKPure.
Among 70 apps available in both stores, 41 shared the same package name but behaved differently in runtime anti-analysis tests, suggesting region-specific variants or version skew matters for packer study.
Youpk failed to dump Dex for 86% of 120 Ijiami-packed apps even after extending the wait beyond the recommended duration.
DexHunter, tested on a second group of 20 packed apps, failed entirely on the selected modern samples.
BlackDex crashed or was otherwise unable to handle modern Dex release patterns on the second app group, while frida-dexdump was detected by anti-Frida checks and terminated before completion.
Purifire is reported to bypass packer anti-analysis checks and to increase the number of detected artifacts in prior studies, including device fingerprints, though the paper excerpt does not provide the exact deltas for those downstream studies.

Threat model

The adversary is the packer embedded in the Android app, including its anti-debugging, anti-instrumentation, anti-emulation, anti-unpacking, and environment-check logic. The packer may inspect files, processes, memory maps, timing, network state, Java framework APIs, or native-level signals to detect analysis tools such as Frida, Xposed, JDB, ptrace-based debuggers, or unpackers. What the adversary cannot do, in the paper’s assumed model, is directly control the analyst’s kernel-level eBPF enforcement layer or see through the kernel-resident evasion rules without triggering the same OS-level abstractions they rely on; the analyst also lacks source code and packer internals.

Methodology — deep read

Threat model and assumptions: the paper assumes a security analyst who has only the packed APK and a device/runtime environment, but not source code or packer internals. The adversary is the packer/service embedded in the app, which may include anti-debugging, anti-instrumentation, anti-emulation, file/process/memory checks, and packer-specific unpacker fingerprints. The packer’s goal is to detect analysis and either terminate, degrade, or hide behavior; the analyst’s goal is to observe runtime behavior without fully unpacking the application. The paper also treats app developers and users as stakeholders: developers may legitimately want anti-tamper protection, while users and analysts may want visibility into suspicious or privacy-invasive behavior. Purifire is explicitly positioned as a “live with the packer” approach rather than a defeating/unpacking approach.

Data provenance and collection: the prevalence study uses 12,341 apps collected between June and August 2024 from two sources: 7,913 apps from the 360 App Store and 4,428 apps from APKPure. The authors split Chinese vs non-Chinese apps because the distribution ecosystems and protection practices differ substantially. Collection was automated in two phases: Selenium scraped download URLs from both sites, and File Centipede handled APK downloading. They note 70 apps existed in both sources, and 41 of those had the same package name but differed in runtime behavior, which they interpret as either version differences or market-specific variants. For categorization, they translated Chinese category names and mapped them onto 29 predefined non-Chinese categories; they explicitly call this a simple mapping and point to more sophisticated categorization as future work.

Architecture / algorithm: the paper has two technical parts. First, it builds a taxonomy of anti-analysis techniques, organized by underlying detection principle rather than just surface category. Table 1 groups checks into file-based, activity-based, memory-based, timer-based, network-based, Java/framework, and misc/native-level principles, with examples such as TracerPid, /proc/self/maps, frida-server paths, JDWP checks, TLS pinning, ptrace, and signal tricks. That taxonomy is not just descriptive; it is intended to guide evasion-rule design. Second, Purifire uses eBPF as a kernel-resident enforcement and observability layer. The key idea is that eBPF can see and influence events in the kernel/userspace boundary with low visibility to the target app, allowing the analyst to define rules that make anti-analysis probes return benign results or otherwise fail open. The paper says the implementation is shipped as a statically linked eBPF program for arm64 Android using Aya-Rust and eBPF features like maps, BTF, kprobes/uprobes, and CO-RE. From the text provided, the exact rule language, interception points, and how each anti-analysis principle maps to specific eBPF actions are not fully detailed, so that part remains somewhat opaque in the excerpt.

Training regime and concrete example: this is not a machine-learning paper, so there is no training loop, optimizer, or epoch schedule. The closest thing to a concrete end-to-end workflow is the unpacker evaluation and the runtime anti-analysis evaluation. For unpackers, they randomly selected 120 Ijiami-packed apps for one group and 20 arbitrary packed apps for another. They then tested ART-based unpackers such as Youpk and DexHunter, and memory-based unpackers such as BlackDex, frida-dexdump, and KissKiss. For example, on the 120-app Ijiami subset, Youpk tried to hook Dex loading points such as DexClassLoader/BaseDexClassLoader, but still failed to dump Dex on 86% of apps. This matters because the dumped output, even when obtained, was often incomplete or dependent on native interactions, so the recovered Dex would not support the analyst’s original goal of dynamic runtime observation. For runtime anti-analysis, they used Frida in both attach and spawn modes and JDB to probe whether the packed apps would detect or react to instrumentation/debugging. The excerpt indicates they also tested the practical impact on previous academic projects that depended on runtime hooking, but the detailed per-project experimental design and metrics are not fully shown here.

Evaluation protocol and reproducibility: evaluation is split into prevalence, unpacker robustness, and Purifire-assisted runtime analysis. The prevalence work uses detector disagreement resolution between APKiD and NP-Manager, with manual inspection of shared libraries and packer-name mapping when labels conflicted. The unpacker study is intentionally limited: they did not test every app in the 12,341-app corpus because the process was time-consuming; instead, they sampled 120 and 20 apps for deeper unpacker experiments. Metrics are mostly prevalence counts, unpacking success/failure, and qualitative ability to bypass anti-analysis checks; the excerpt does not mention formal statistical tests, confidence intervals, or cross-validation. Reproducibility is partial: the paper describes the tools, dataset sources, and general methodology, but the excerpt does not confirm whether the dataset, code, or Purifire rules are publicly released. The paper’s strongest reproducibility anchor is its taxonomy and the enumerated packer/unpacker tool list in Tables 1 and 2, but exact implementation details for Purifire remain under-specified in the provided text.

Technical innovations

A taxonomy of Android anti-analysis techniques organized by detection principle, meant to translate directly into evasion rules rather than just classify symptoms.
Purifire, a kernel-level eBPF-based evasion framework that aims to preserve the packed app intact while selectively defeating anti-analysis checks at runtime.
A large-scale prevalence study of commercial Android packers across 12,341 real-world apps, split across Chinese and non-Chinese app stores to reflect ecosystem differences.
An empirical demonstration that older Android unpackers no longer reliably recover usable apps against modern commercial packers, especially those with dynamic loading and anti-instrumentation defenses.

Datasets

360 App Store corpus — 7,913 apps — scraped June-August 2024
APKPure corpus — 4,428 apps — scraped June-August 2024
Combined prevalence dataset — 12,341 apps total — assembled from 360 App Store and APKPure
Ijiami-packed evaluation subset — 120 apps — sampled from the dataset
Arbitrary packed-app evaluation subset — 20 apps — sampled from the dataset

Baselines vs proposed

Packed-app prevalence (APKiD or NP-Manager): 4,735/12,341 apps packed = 38% overall vs proposed prevalence analysis confirms market skew
Chinese app packer prevalence: 4,652/7,913 = 58.8% packed vs non-Chinese apps: 83/4,428 = 2% packed
Youpk: Dex dump success = 14% on 120 Ijiami-packed apps vs proposed observation that many apps remain analyzable only in-place via Purifire
DexHunter: success = 0% on the selected 20 modern packed apps vs proposed runtime evasion with Purifire
BlackDex: unable to handle modern Dex release patterns / crashed on the second app group vs proposed kernel-level evasion that avoids unpacking
frida-dexdump: app terminated before completion due to anti-Frida detection vs proposed low-profile eBPF-based bypass

Limitations

The unpacker evaluation is sampled, not exhaustive: only 120 Ijiami-packed apps and 20 arbitrary packed apps were tested, so the failure rates may not generalize perfectly to all packers or app types.
The paper excerpt does not provide exact Purifire performance numbers for bypass success, latency, overhead, or false-positive/false-negative rates of the evasion rules.
The prevalence analysis depends on APKiD and NP-Manager, both of which have database and labeling limitations; the authors had to manually reconcile packer-name mappings.
The dataset is biased toward two app stores and toward a June-August 2024 collection window, so prevalence could drift over time or differ on Google Play or niche stores.
The excerpt does not describe a formal adversarial evaluation against packer updates, so the durability of Purifire against future packer changes remains uncertain.
The paper focuses on anti-analysis visibility rather than on whether the resulting observations are semantically complete; bypassing a check does not guarantee full behavioral coverage.

Open questions / follow-ons

How resilient is an eBPF-based evasion layer when packers start probing kernel behavior, syscall timing, or eBPF-related artifacts more aggressively?
What is the overhead of Purifire in real interactive analysis sessions, especially on lower-end arm64 devices, and how does it affect timing-sensitive apps?
Can the anti-analysis-principle taxonomy be turned into a portable rule compiler that emits safe eBPF policies automatically from analyst intent?
How well does the approach generalize to apps with heavier native code, multi-stage loaders, or packers that migrate checks into proprietary kernels or trusted execution environments?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the paper is a reminder that client-side instrumentation is increasingly brittle against modern protection layers. If your workflow relies on Frida-style hooks, runtime property overrides, or debugger attachment to understand challenge flows, packers can suppress or distort exactly the signals you care about. The practical lesson is to separate visibility from unpacking: in some cases, kernel-level observability or device-side policy enforcement may be a better fit than trying to reconstruct a clean binary first.

It also has a defensive angle. The taxonomy of anti-analysis principles is useful for understanding how apps may hide bot logic, risk scoring, or anti-abuse gates from auditors. If you are evaluating an app that ships with commercial protection, you should expect checks against emulators, overlays, JDWP, Frida, rooted environments, and timing anomalies. Purifire’s design suggests a path for analysts to study those checks in place, but it does not remove the need to validate that the observed behavior is still representative once defenses are neutralized. In other words, it improves access, not trust.

Cite

bibtex

@article{arxiv2509_16340,
  title={ To Unpack or Not to Unpack: Living with Packers to Enable Dynamic Analysis of Android Apps },
  author={ Mohammad Hossein Asghari and Lianying Zhao },
  journal={arXiv preprint arXiv:2509.16340},
  year={ 2025 },
  url={https://arxiv.org/abs/2509.16340}
}

To Unpack or Not to Unpack: Living with Packers to Enable Dynamic Analysis of Android Apps ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​