Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
Source: arXiv:2605.06305 · Published 2026-05-07 · By Thomas Cory, Axel Küpper
TL;DR
This paper tackles the chronic labeled-data problem in automated HTTP traffic privacy audits, where existing PII detectors (SVMs, BERT-style classifiers) must be retrained whenever a label taxonomy changes and require expensive manual ground-truth labeling. The authors propose two complementary LLM-based systems: (1) a multi-stage annotation pipeline that accepts any PII taxonomy at runtime—without retraining—and identifies which PII types are present in an HTTP message body before extracting verbatim value spans; and (2) a paired synthetic HTTP traffic generator that produces plausible request/response pairs with pre-labeled PII values so that the annotation pipeline can be evaluated and few-shot-prompted without touching real user data.
The annotation pipeline decomposes the task into a classifier stage (which labels are present?), a targeted annotator stage (where exactly are those values?), and a reviewer stage (catch missed values and boundary errors). Each stage is wrapped in a shared harness that injects retrieved examples via RAG and validates/normalises LLM outputs against the taxonomy before passing them downstream. The synthetic generator is taxonomy-driven and LLM-agnostic, producing scenario templates first, then concrete HTTP requests, then optional responses, with per-stage validation enforcing verbatim value occurrence.
The authors evaluate across three taxonomies—AI4Privacy (53 classes), mHealth (38 classes), and PlayStore (38 classes)—using entirely synthetic datasets that were manually validated. Results show high label-level F1 (0.986, 0.986, 0.911 respectively) and strong instance-level fuzzy F1 (0.975, 0.965, 0.855) for the two-stage annotator configuration before review. The PlayStore taxonomy consistently scores lower, which the authors attribute to its coarser, category-level granularity making unambiguous value-level extraction harder. The reviewer stage does not uniformly improve results and in some cases slightly hurts performance, a finding the paper discusses but does not fully resolve.
Key findings
- Label-level F1 for the classifier stage reaches 0.9860 (AI4Privacy, 53 classes), 0.9858 (mHealth, 38 classes), and 0.9109 (PlayStore, 38 classes) on synthetic HTTP request corpora, with PlayStore's lower score attributed to coarser, category-level label granularity.
- Two-stage annotator (classify-then-extract) outperforms single-stage annotator on fuzzy instance-level F1 for AI4Privacy (0.9776 vs 0.9720) and mHealth (0.9703 vs 0.9593), but the margin is negligible for PlayStore (0.8625 vs 0.8633), suggesting decomposition helps most when the taxonomy is fine-grained.
- The reviewer stage does not consistently improve performance: for AI4Privacy it drops fuzzy F1 from 0.9776 to 0.9747 (two-stage path), and for PlayStore exact F1 falls from 0.8012 to 0.7973, indicating the correction mechanism introduces its own errors at roughly the same rate it fixes them.
- Exact-match instance-level F1 is substantially lower than fuzzy F1 across all taxonomies and configurations (e.g., AI4Privacy two-stage annotator: 0.9438 exact vs 0.9776 fuzzy; PlayStore two-stage annotator: 0.8012 exact vs 0.8625 fuzzy), confirming that boundary selection and minor normalization artifacts are the dominant remaining error source rather than label misidentification.
- The synthetic generator achieves full label coverage across all three taxonomies using 35 scenarios (AI4Privacy), 25 scenarios (mHealth), and 29 scenarios (PlayStore), each capped at 100 HTTP requests per scenario, with manual validation of scenario definitions, requests, and annotations.
- RAG-based example injection serves three empirically described purposes: anchoring label vocabulary, enabling lightweight domain adaptation without fine-tuning, and mitigating cold-start for novel labels—though quantitative ablations on RAG contribution are not reported separately in the truncated text.
- Output validation with cosine-similarity-based label correction (equation 2, threshold-gated) resolves benign surface deviations such as capitalisation and hyphenation differences without requiring LLM re-queries, preventing error propagation across pipeline stages.
Threat model
n/a — this is a privacy-audit tooling paper, not a security attack or defense paper. The implicit operational context is a privacy auditor with passive access to outbound HTTP traffic from web or mobile applications, seeking to detect whether PII is transmitted in compliance with GDPR/CCPA and similar regulations. The 'adversary' is the application's own data practices, not a malicious external actor. There is no model of evasion, adversarial inputs, or attacker knowledge. The pipeline explicitly avoids inferring PII that is not explicitly present in the payload, which is an implicit design-time constraint rather than a security guarantee.
Methodology — deep read
The threat model and task framing assume a privacy auditor who intercepts outbound HTTP traffic from web or mobile applications and wants to detect explicitly transmitted PII values in message bodies. The adversary is not a malicious actor but rather the application itself inadvertently leaking data. The pipeline assumes the taxonomy is known at audit time (provided at runtime via prompt) but may differ across audits. There is no adversarial evasion modeled—the paper is not a security attack/defense paper but a labeling-assistance tool.
Data provenance is entirely synthetic. For each of three taxonomies, the authors ran their own synthetic HTTP generator to produce labeled corpora. Taxonomy 1 (AI4Privacy) uses the AI4Privacy pii-masking-200k label set as adapted by Manietti and Elia, 53 classes. Taxonomy 2 (mHealth) follows Cory et al. (2024)'s PHI scheme for mHealth apps, 38 classes. Taxonomy 3 (PlayStore) mirrors the Google Play Data Safety disclosure categories, 38 classes. Corpora were manually validated by the authors for structural well-formedness, taxonomy compliance, verbatim value occurrence, and semantic plausibility of label assignments. Datasets were split 80/20 with stratification enforcing full label coverage and approximate label-distribution preservation in both partitions. Exact corpus sizes are not stated in the truncated text. The 20% split serves as the test set; the 80% split's examples feed the RAG database for few-shot retrieval during inference.
The annotation pipeline has three LLM-powered stages, each wrapped by a shared harness. Stage 1 (Classifier): given HTTP message body m and taxonomy L, predicts a subset of labels Lm ⊆ L that have at least one explicitly present value in m—it does not infer PII from key names with empty values. Stage 2 (Annotator): given m and Lm, performs targeted NER over the normalized body, producing (data_type, value) tuples. Two configurations are evaluated: two-stage (classifier output feeds annotator) versus single-stage (annotator receives raw body only and does joint detect+extract). Stage 3 (Reviewer): receives original body plus the annotation set and performs targeted correction—adding missed values, correcting label assignments, adjusting value boundaries. Critically, the reviewer does not re-annotate from scratch; it conditions on prior output.
The shared harness injects retrieved examples via two complementary RAG signals: structural similarity (SBERT embeddings over message bodies, cosine top-k from a vector DB of annotated examples) and label coverage (greedy selection ensuring each predicted label has at least one demonstration). Output validation checks JSON well-formedness, schema conformance, and taxonomy compliance. Invalid label strings are corrected via cosine nearest-neighbor to valid labels (equation 2) only when similarity exceeds a threshold; otherwise the annotation is discarded or flagged for re-query. Pre-processing normalizes message bodies via URL decoding (RFC 3986 unquote_plus), HTML entity resolution, and optional Base64 decoding with a conservative syntactic guard. Both original and normalized representations are retained for span mapping.
The evaluation protocol uses label-level multi-label presence F1 (set-based TP/FP/FN over unique labels per message) and instance-level F1 under both exact matching (label and value string identical) and fuzzy matching (SequenceMatcher similarity ≥ 0.8 with greedy one-to-one assignment, following Cory et al. 2026). Results are reported at four pipeline checkpoints: after classifier, after single-stage annotator, after two-stage annotator, and after reviewer for each path. This allows attribution of errors to classification, extraction, or review stages. No statistical significance tests are reported. No cross-validation is described; a single 80/20 split is used per taxonomy. The specific LLM model(s) used are not stated in the truncated text—this is a notable gap for reproducibility.
To walk through a concrete end-to-end example: an HTTP POST request body like '{"user_email": "[email protected]", "age": 29}' is first URL-decoded and HTML-entity-resolved (no-ops here). The classifier receives this normalized body plus the full AI4Privacy taxonomy (53 labels) and predicts Lm = {EMAIL_ADDRESS, AGE}. The annotator then receives the body and only these two labels, extracting {(EMAIL_ADDRESS, '[email protected]'), (AGE, '29')}. The reviewer receives both the body and this annotation set, and may adjust—for instance correcting a boundary if the annotator included surrounding quotes. Output validation checks that EMAIL_ADDRESS and AGE are valid taxonomy strings before passing downstream. Reproducibility is limited: no model names, frozen weights, or code repository are mentioned in the available text; the synthetic datasets are generated rather than distributed, and full regeneration requires the generator plus manual validation.
Technical innovations
- A taxonomy-agnostic, multi-stage LLM annotation pipeline for HTTP traffic that switches label schemes at runtime via prompt injection, contrasting with prior fixed-taxonomy detectors like ReCon and BERT-based classifiers that require retraining per taxonomy.
- Decomposed classify-then-extract architecture (Annotate(m) = Extract(m, Classify(m))) that conditions value extraction on a predicted label subset, reducing ambiguity in short identifiers and numeric values common in HTTP payloads—unlike GPT-NER's single-pass generation approach.
- A dual-signal RAG strategy combining structural similarity retrieval (SBERT embeddings over message bodies) with greedy label-coverage retrieval to ensure every predicted PII type has at least one demonstration, enabling cold-start adaptation for novel labels without fine-tuning.
- A cosine-nearest-neighbor output validator (equation 2) that corrects benign label surface deviations (capitalisation, hyphenation, minor paraphrase) against the target taxonomy at inference time, preventing error propagation without LLM re-queries.
- An LLM-based synthetic HTTP traffic generator that is taxonomy-driven and LLM-agnostic, producing scenario templates, concrete requests, and optional conditioned responses with verbatim-verified PII annotations to bootstrap labeled datasets under labelled-data scarcity—distinct from SPY (Savkin et al.) which targets PII detection benchmarks rather than network-traffic annotation pipelines.
Datasets
- AI4Privacy synthetic HTTP corpus — size not stated in available text — generated by authors' synthetic pipeline, manually validated, derived from AI4Privacy pii-masking-200k taxonomy (Manietti and Elia) with 53 classes, 35 scenarios
- mHealth synthetic HTTP corpus — size not stated in available text — generated by authors' synthetic pipeline, manually validated, derived from Cory et al. (2024) PHI taxonomy for mHealth apps with 38 classes, 25 scenarios
- PlayStore synthetic HTTP corpus — size not stated in available text — generated by authors' synthetic pipeline, manually validated, derived from Google Play Data Safety disclosure taxonomy with 38 classes, 29 scenarios
Baselines vs proposed
- Single-stage Annotator (AI4Privacy): F1_label=0.9857, F1_fuzzy=0.9720, F1_exact=0.9377 vs Two-stage Annotator: F1_label=0.9888, F1_fuzzy=0.9776, F1_exact=0.9438
- Single-stage Annotator (mHealth): F1_label=0.9843, F1_fuzzy=0.9593, F1_exact=0.9549 vs Two-stage Annotator: F1_label=0.9880, F1_fuzzy=0.9703, F1_exact=0.9658
- Single-stage Annotator (PlayStore): F1_label=0.9091, F1_fuzzy=0.8633, F1_exact=0.8061 vs Two-stage Annotator: F1_label=0.9184, F1_fuzzy=0.8625, F1_exact=0.8012
- Reviewer (two-stage, AI4Privacy): F1_fuzzy=0.9747, F1_exact=0.9400 vs Two-stage Annotator pre-review: F1_fuzzy=0.9776, F1_exact=0.9438 (reviewer slightly degrades performance)
- Reviewer (two-stage, mHealth): F1_fuzzy=0.9647, F1_exact=0.9597 vs Two-stage Annotator pre-review: F1_fuzzy=0.9703, F1_exact=0.9658 (reviewer slightly degrades performance)
- Reviewer (two-stage, PlayStore): F1_fuzzy=0.8554, F1_exact=0.7973 vs Two-stage Annotator pre-review: F1_fuzzy=0.8625, F1_exact=0.8012 (reviewer slightly degrades performance)
Limitations
- All evaluation data is synthetic and manually validated by the paper's own authors—there is no held-out real HTTP traffic corpus, so generalization to actual application traffic with encoding diversity, obfuscation, and non-standard serialization formats is undemonstrated.
- The specific LLM(s) used for all pipeline stages are not identified in the available text, making direct reproducibility and cost estimation impossible; results may be highly model-dependent and could degrade significantly on smaller or open-weight models.
- The reviewer stage consistently degrades performance across all three taxonomies on the two-stage path (e.g., AI4Privacy fuzzy F1 drops from 0.9776 to 0.9747), but the paper does not ablate why—whether this is prompt design, over-correction behavior, or confirmation bias—leaving the stage's utility unclear.
- No adversarial evaluation is conducted: applications that intentionally obfuscate PII (encoding, tokenization, splitting values across fields) are outside scope, which is a meaningful gap for real privacy audit deployments.
- Exact corpus sizes (number of HTTP requests per taxonomy) are not reported in the available text, making it difficult to assess statistical power of the evaluation or the cost of manual validation.
- RAG contribution is not ablated quantitatively in isolation—the paper describes its empirical value qualitatively but does not report F1 with RAG disabled, making it impossible to measure how much the retrieval mechanism contributes versus the base LLM capability.
- Label coverage for all three taxonomies is reported as achievable but 'perfect coverage is not always attainable for arbitrary taxonomies'; no quantitative coverage figures are given, so it is unclear how many labels in each taxonomy were underrepresented or absent from the evaluation.
Open questions / follow-ons
- Does the pipeline's performance hold on real, non-synthetic HTTP traffic with production-level encoding variability, compressed payloads, and non-standard serialization? A systematic comparison against a held-out real-traffic corpus is the most critical missing validation.
- What is the performance floor when the LLM is replaced with a smaller or locally deployable open-weight model (e.g., Llama-3, Mistral)? The current results are tied to an unspecified model, and cost/latency tradeoffs for production-scale traffic inspection are unexplored.
- Can the reviewer stage be redesigned—or removed—without performance loss? Its consistent negative effect on all tested configurations suggests the current post-annotation correction mechanism introduces more errors than it resolves; understanding whether this is a prompt design issue or a fundamental limitation of self-correction in this task is important.
- How does the pipeline behave under intentional PII obfuscation (e.g., split values, encoded identifiers, salted hashes, tokenized card numbers)? This is the most practically relevant distribution shift for real privacy audits and is entirely unaddressed.
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, the most direct relevance is as a labeling-infrastructure primitive rather than a detection model. Teams that instrument client-side signals or intercept traffic for fraud/bot analysis routinely face the same labeled-data scarcity problem: every time the signal taxonomy evolves (new device fingerprint fields, new behavioral features, new risk categories), retraining supervised classifiers requires fresh labeled data. The pipeline described here demonstrates that an LLM can be prompted with a new taxonomy at runtime and produce high-quality annotations on HTTP payloads without retraining, which could accelerate the label-generation cycle for network-level bot signal corpora.
The synthetic traffic generator is also worth attention. Bot-defense teams often cannot share real traffic logs for internal tooling development due to privacy or legal constraints. An analogous LLM-driven generator scoped to bot-behavioral HTTP patterns (automation headers, WebDriver artifacts, headless browser fingerprints) could produce labeled synthetic traffic for training anomaly detectors or testing detection pipelines. The paper's finding that PlayStore's coarser taxonomy degrades performance suggests that practitioners with high-level categorical signal taxonomies (e.g., 'automated traffic' vs 'human traffic') may see weaker results than those with fine-grained behavioral taxonomies—a useful calibration signal before investing in this approach. The reviewer stage's consistent performance degradation is a practical caution: adding a self-correction loop does not reliably improve output quality and may not be worth the additional inference cost in latency-sensitive pipelines.
Cite
@article{arxiv2605_06305,
title={ Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs },
author={ Thomas Cory and Axel Küpper },
journal={arXiv preprint arXiv:2605.06305},
year={ 2026 },
url={https://arxiv.org/abs/2605.06305}
}