FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning

Source: arXiv:2605.06596 · Published 2026-05-07 · By Su Zhang, Junfeng Guo, Heng Huang

TL;DR

This paper addresses the problem of client-level data attribution in federated fine-tuning of large language models (LLMs), where multiple clients collaboratively fine-tune a shared model without revealing their private data updates due to secure aggregation (SA). Existing watermark radioactivity testing methods can detect if a global model was trained on watermarked data but cannot identify which clients used the watermarked documents because SA hides individual updates. FedAttr introduces a novel protocol that enables privacy-preserving identification of watermarked clients while preserving the SA privacy guarantees and FL training performance. It estimates each client’s update using paired subset SA queries differing only in presence of the target client, scores these with differential watermark detection to cancel bias from global accumulated watermarks, and combines per-round scores via Stouffer’s method to amplify signal across rounds.

The authors provide rigorous theoretical guarantees proving that the estimator is unbiased with bounded variance, that error probabilities decay exponentially with number of rounds, and that mutual information leakage is bounded by O(d*/N) per round, where d* is the effective dimension and N is subset size. Empirically, FedAttr achieves 100% true positive rate and 0% false positive rate in client attribution across two watermark types and FL algorithms under SA constraints, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only a 6.3% overhead to FL training time. Ablations confirm robustness to parameters and heterogeneity. This work effectively bridges watermark data ownership protection with privacy-preserving federated learning.

Key findings

FedAttr achieves 100% true positive rate (TPR) and 0% false positive rate (FPR) in client-level watermark attribution under secure aggregation, across two watermark families and two FL aggregation strategies (FedIT, FLoRA).
It outperforms baselines by at least 44.4% in TPR or 19.1% in FPR, despite baselines violating SA privacy assumptions or having high FPR (e.g., direct scoring yields 57% FPR).
The per-round client update estimator is unbiased with variance bounded by non-target client update covariance, shown in Theorems 1 and 2.
False negative and false positive attribution error rates decay exponentially with number of communication rounds T, according to Theorem 3, enabled by Stouffer score aggregation.
Mutual information leakage per round about client updates is bounded by O(d*/N), preserving privacy despite multiple SA subset queries (Theorem 4).
Analysis shows differential scoring reduces accumulated watermark bias in global model, dropping FPR from 57% to near zero.
FedAttr overhead is only 6.3% relative to federated training time (including SA queries and watermark scoring), with computation overlapped on the server side.
Ablation studies confirm robustness for varying number of watermarked clients, subset sizes, query counts, LoRA ranks, datasets, and FL non-IID heterogeneity, with some TPR degradation only under severe heterogeneity.

Threat model

The adversary is honest-but-curious: clients faithfully follow the federated learning protocol but some may have trained on watermarked documents violating license terms. The server and corpus owner may attempt to infer which clients used watermarked data but cannot access individual client updates directly due to secure aggregation. The server only sees subset sums over client updates meeting threshold N_sa and cannot compromise client privacy beyond bounded mutual information leakage. The corpus owner holds the watermark detection key but does not see client updates or queries. The threat model excludes malicious or Byzantine clients, and assumes no collusion to reveal updates.

Methodology — deep read

Threat model and assumptions: The adversary model assumes an honest-but-curious FL system with K clients, a server coordinating federated fine-tuning over T rounds using secure aggregation (SA), and a corpus owner who holds the watermark detection key. Some unknown subset of clients train on watermarked documents in violation of data license terms. All clients follow the protocol faithfully (non-adversarial), but the server and corpus owner may be curious and try to infer which clients used watermarked data without violating the privacy guarantees of SA. The server sees only secure aggregates (subset sums) of client updates, not individual updates. The corpus owner gets only estimations from server, but never sees client updates directly or knows the queries.
Data and experimental setup: Experiments use LLaMA 3B fine-tuned on UltraChat200k, partitioned IID across K=10 clients, with T=5 communication rounds. Watermarks from two families are used: KGW (green-token bias watermark) and Fictitious Knowledge (QA-based watermark). Two FL aggregation strategies (FedIT and FLoRA) are evaluated. Evaluation metrics are TPR, FPR, and statistical significance (p-values) with baselines including global model watermark test, direct oracle with plaintext updates, FLDetector and FLForensics.
FedAttr architecture / algorithm: FedAttr proceeds in three main steps: (i) Client-level update estimation: The server generates M paired subset queries for each client i per round—one subset including client i and one excluding it, both subsets have size at least N or N+1 to satisfy SA threshold. The difference of corresponding subset sums yields an unbiased estimator bΔ_t^i of client i's update with bounded variance. A rejection sampling condition rejects poorly masking subsets. (ii) Differential scoring: The corpus owner applies the watermark detector SCORE(w; P_t) on the estimated client update models (global model plus bΔ_t^i) and subtracts the global model score to remove accumulated watermark bias. This differential score z_t^i reflects the per-round contribution of client i's update to the watermark. (iii) Cross-round aggregation: Differential scores z_t^i are summed via Stouffer’s method to amplify the weak per-round signals; the aggregated Z_i statistic is thresholded to decide whether client i is watermarked.
Training regime: FL fine-tuning is performed over T=5 rounds with default parameters (M=5 paired queries, N=5 subset size, K=10 clients). LoRA adapters are used for computational efficiency. Multiple random seeds (3) ensure statistical robustness. Training hardware details are unspecified but scaled for LLaMA 3B.
Evaluation protocol: Metrics are TPR (fraction watermarked clients correctly flagged) and FPR (benign clients falsely flagged). Baselines use either plaintext updates or global model only. Statistical significance (p-values) derived from Stouffer scores. Ablations vary number of watermarked clients r, subset size N, query count M, watermark ratio, LoRA ranks, datasets, and non-IID heterogeneity. Also scalability tested by increasing K from 10 to 100.
Reproducibility: Code release or datasets not explicitly mentioned but many details and hyperparameters described. Watermark families are from published prior work. Full theoretical proofs in appendix. Empirical results average over 3 seeds and report standard deviations.

Concrete example: Consider round t, client i. The server samples M subset pairs: each U in U_i^N (including i) and each V in V_i^N (excluding i). After checking that the masking coefficient sums satisfy the rejection criterion, it queries SA interface for sums of client updates over each subset U and V. It averages the differences S_t(U) - S_t(V) to form estimator bΔ_t^i. The corpus owner scores w_t−1 + bΔ_t^i vs w_t−1 using the watermark detector on prompt set P_t and produces differential z_t^i. After T=5 rounds, the owner aggregates {z_t^i} by Stouffer’s formula and decides if client i is watermarked based on threshold γ=4. This methodology yields unbiased attribution while preserving SA’s privacy guarantees.

Technical innovations

Paired-subset-difference mechanism to construct unbiased estimators of individual client updates from secure aggregation subset sums, circumventing SA privacy constraints.
Differential scoring method subtracting global model watermark scores from estimated client updates to cancel accumulated watermark bias and isolate each client's signal.
Cross-round Stouffer combination of per-round differential scores that accumulates weak watermark signals and drives attribution error rates exponentially down with number of rounds.
Theoretical analysis providing unbiasedness, variance bounds, exponential error decay, and mutual information leakage bounds O(d*/N) regarding privacy leakage under multi-query SA.
A rejection sampling condition on subset query design ensuring sufficient masking noise to protect privacy without compromising estimator quality.

Datasets

UltraChat200K — 200,000 conversations — publicly described in Ding et al. (2023)
LLM pretraining corpora for watermark generation (KGW and Fictitious Knowledge) are from prior published watermarking works (Kirchenbauer et al., 2023; Cui et al., 2025).

Baselines vs proposed

Global model test (no client attribution): TPR = 55.6-100%, FPR = 57.1% across watermarks and FL algorithms
Direct (oracle with plaintext updates, violates SA): TPR = 55.6-100%, FPR = 57.1-71.4%
FLDetector (Zhang et al. 2022, requires plaintext updates): TPR = 0%, FPR = 14.3-23.8%
FLForensics (Jia et al. 2024, requires plaintext updates): TPR = 0-100%, FPR = 0-23.8%
FedAttr (ours, preserves SA): TPR = 100%, FPR = 0%, p < 10^-6; achieves perfect attribution under privacy constraints
FedAttr outperforms all baselines by at least +44.4% TPR or -19.1% FPR while preserving secure aggregation.

Limitations

Experiments conducted on relatively small number of clients (K=10) with scaling up to 100 clients explored only in appendix; scalability to very large federations not thoroughly demonstrated.
Non-IID data heterogeneity settings cause some decrease in attribution TPR (e.g., down to 67% at α=0.1), indicating sensitivity to strong client data distribution shifts.
Assumes honest-but-curious clients that do not deviate from protocol; robustness to adversarial or Byzantine clients not analyzed.
Privacy guarantees quantified only in mutual information leakage bounds but no formal differential privacy.
Requires multiple SA queries per client per round (M=5) increasing communication and computational overhead modestly.
Watermark detectors must be available as black-box scoring functions; reliance on existing watermark methods limits adaptation to new watermark types.

Open questions / follow-ons

How does FedAttr perform under stronger non-IID heterogeneity or asynchronous client participation scenarios, which are common in practical FL settings?
Can the privacy guarantees be strengthened from mutual information bounds to formal differential privacy with quantifiable privacy budgets?
How effective and efficient is FedAttr against malicious clients that attempt to evade watermark attribution or manipulate subset queries?
Can the FedAttr approach be generalized to other kinds of data ownership proofs beyond watermark radioactivity, or to non-text modalities?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, FedAttr provides a novel privacy-preserving forensic attribution method that can precisely identify which individual clients in a federated setup used proprietary or license-restricted data via watermark detection despite secure aggregation protections. This advances traceability and accountability in collaborative model training without sacrificing user privacy, critical in real-world compliance scenarios. The paired subset query and differential scoring techniques can inspire new approaches for client behavior attribution under privacy constraints in federated or distributed systems vulnerable to misuse. Moreover, the rigorous theoretical privacy analysis provides confidence in deploying attribution in privacy-sensitive environments.

Practitioners can consider FedAttr’s methodology as a blueprint for integrating client-level forensic signals into federated bot-detection models or CAPTCHA systems, where understanding participant contribution to model updates is crucial without exposing individual data. The low overhead and robustness to protocol parameters highlight its practicality for large-scale deployments that balance privacy and transparency.

Cite

bibtex

@article{arxiv2605_06596,
  title={ FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning },
  author={ Su Zhang and Junfeng Guo and Heng Huang },
  journal={arXiv preprint arXiv:2605.06596},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06596}
}

FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​