PassREfinder-FL: Privacy-Preserving Credential Stuffing Risk Prediction via Graph-Based Federated Learning for Representing Password Reuse between Websites

Source: arXiv:2510.16083 · Published 2025-10-17 · By Jaehan Kim, Minkyoo Song, Minjae Seo, Youngjin Jin, Seungwon Shin, Jinwoo Kim

TL;DR

PassREfinder-FL tackles a practical problem that sits between pure detection and pure policy enforcement: how to predict which website pairs are likely to share reused passwords, so administrators can estimate credential-stuffing risk before an attack happens. The authors frame this as a graph link-prediction problem over websites, where an edge means the rate of cross-site password reuse exceeds a threshold. The main novelty is not the graph idea alone, but combining it with federated learning so multiple website administrators can collaborate without sharing raw passwords, usernames, or local graphs.

The paper’s core result is that a graph-based federated model can learn these relations at scale from breached-credential data. On a dataset of 360 million breached accounts from 22,378 websites, the FL version reports F1 = 0.9153. In their ablations, the proposed GNN design outperforms other state-of-the-art GNN variants by roughly 4–11%, and the authors argue that the output probabilities can be interpreted as actionable risk scores for website pairs. The work is positioned as an extension of their earlier PASSREFINDER system, replacing heuristic hidden relations with collaborative training across administrators.

Key findings

On a real-world dataset of 360 million breached accounts spanning 22,378 websites, PassREfinder-FL reports F1 = 0.9153 in the federated-learning setting.
The paper says the FL-based GNN improves over other state-of-the-art GNN variants by 4–11% in ablation comparisons.
Unlike the earlier PASSREFINDER design, the FL version removes the need to share password-reuse information or usernames across administrators.
The model treats password-reuse risk as a link-prediction problem on a website graph, where an edge is positive if the password reuse rate exceeds the threshold τ.
Five feature modalities are used per website: location, category, content, URL, and security posture; the authors claim these jointly improve prediction versus less multimodal variants.
The system is designed to ingest new websites by extracting public website information and adding them as graph nodes, instead of requiring a fixed closed-world site set.
The predicted edge probabilities are validated as risk scores, meaning higher scores correspond to higher expected password-reuse likelihood between websites.

Threat model

The adversary is primarily a set of website administrators who may not fully trust one another and should not need to reveal user secrets, as well as the privacy risk created by collaborative modeling itself. The system assumes only public website information can be used for features, and that participants exchange model updates or node embeddings rather than raw passwords, usernames, or local graphs. It does not assume the attacker can directly access the protected credentials through the model pipeline, but the excerpt also does not analyze stronger FL attackers such as malicious clients, gradient inversion, or poisoning beyond the collaborative setting.

Methodology — deep read

Threat model and framing: the adversary is not a cryptanalytic attacker but the operational reality of untrusted website administrators and the privacy risks created when those administrators try to collaborate. The paper assumes administrators want to predict cross-admin password-reuse relations, but cannot safely share sensitive raw data such as passwords, usernames, or even detailed local graph information. The task is explicitly proactive: rather than detecting a credential-stuffing attempt after login, the system predicts which website pairs are risky because users are likely to reuse passwords across them. The label is defined by a binary threshold τ on observed password reuse rate between two websites; if the reuse rate exceeds τ, the pair is treated as a positive password-reuse relation.

Data provenance and labeling: the evaluation uses a large breached-credential corpus containing 360 million breached accounts from 22,378 websites. The excerpt does not fully specify how many individual password-reuse pairs were labeled positive/negative, how the train/validation/test split was done, or whether splits were by edge, by website, or by administrator; those details are important because leakage across related websites could inflate link-prediction performance. What is clear from the paper text is that each administrator builds a local password-reuse graph from the websites they manage, with edges derived from their own observed reuse statistics. Cross-admin edges are the prediction target. Preprocessing is feature-specific: IP addresses are converted to 32-bit binary vectors; website categories are normalized and collapsed into 20 classes; content is represented using embeddings from a pre-trained XLM-RoBERTa model; URLs are encoded with a character-level LSTM; and security posture is derived from public scans (Shodan, CVE/CVSS, HTTPS checks). The paper emphasizes that all website attributes are collected from public web-analytics or scanning services rather than directly from user secrets.

Architecture and algorithm: the model is a graph neural network over websites, with password-reuse relations represented as undirected edges. Each website node has a five-modality feature set. The modalities are processed separately so heterogeneous inputs do not have to be squeezed into a single flat vector too early. The paper’s description indicates a multimodal feature encoder, neighborhood aggregation via a GNN, and a final edge classifier that concatenates the two endpoint node embeddings h_u and h_v into h_uv = σ(W_f·(h_u || h_v)), then feeds that into a feed-forward network to predict whether a cross-admin password-reuse edge exists. The GNN design follows their earlier PASSREFINDER work, but the FL version removes the old heuristic “hidden relation” mechanism and instead relies on collaborative training from multiple administrators’ local graphs. The text repeatedly contrasts this with previous approaches that depended on sharing username mappings or using privacy-sensitive protocols. One concrete end-to-end example from the design: an administrator can scan a target website’s public IP, category, HTML text, URL structure, and security posture; these are embedded into a node vector; local GNN message passing computes a representation for that website; then, during inference, two administrators exchange only node embeddings, concatenate them, and classify whether the pair likely has a password-reuse relation.

Training regime and federated learning: the paper’s description is higher-level than a full reproducibility recipe in the excerpt. It states that each administrator trains a local GNN on its own password-reuse graph, sends updated weights or gradients to a central server, the server aggregates them into a global GNN, and the aggregated model is redistributed for the next round. The authors do not specify the number of FL rounds, local epochs, optimizer choice, batch size, learning rate, hidden dimension sizes, dropout, or random seed strategy in the provided text. Likewise, the content embedding relies on a pre-trained XLM-RoBERTa model, but the excerpt does not say whether it was fine-tuned end-to-end or frozen. The evaluation section also suggests there are ablations across GNN variants and feature designs, and that neighborhood attention contributes less under FL than in the original PASSREFINDER framework, while multimodality consistently helps. However, the exact experimental protocol behind the reported 4–11% gains is not fully visible in the excerpt.

Evaluation and reproducibility: the reported headline metric is F1 = 0.9153 in the FL setting. The paper also claims that its FL-based GNN beats other state-of-the-art GNN models by 4–11% in an ablation study, and that the output probabilities correlate with expected password-reuse rates, enabling risk scoring. The excerpt mentions ranking metrics in a figure caption, so the authors appear to evaluate edge-ranking quality as well, but the exact metrics are not shown here. Reproducibility is partially limited by the fact that the underlying breached-account dataset is likely non-public or at least sensitive, and the excerpt does not mention code release, frozen checkpoints, or a public benchmark split. If one wanted to reproduce the core claim, the critical unknowns to resolve would be the exact positive/negative edge construction, the per-admin partitioning strategy, and the FL aggregation details. Those choices matter because graph leakage, especially across highly similar websites or shared categories, can substantially change link-prediction results.

Technical innovations

Recasts credential-stuffing risk prediction as website-level link prediction over a password-reuse graph, rather than as login-time detection or password-policy enforcement.
Introduces password reuse relations as explicit graph edges between websites, using a tunable reuse-rate threshold τ to define labels.
Combines multimodal website feature extraction with GNN message passing to model heterogeneous signals such as location, category, content, URL, and security posture.
Extends the original PASSREFINDER idea with federated learning so multiple administrators can train collaboratively without sharing passwords, usernames, or local graphs.
Uses predicted edge probabilities as calibrated risk scores for cross-site password reuse likelihood, not just binary classification.

Datasets

360 million breached accounts — 22,378 websites — real-world breached credential dataset (source not specified in excerpt)

Baselines vs proposed

PASSREFINDER / variants with hidden-relation policies: F1 = not stated vs proposed FL setting: 0.9153
Other state-of-the-art GNN models: performance = not stated vs proposed FL-based GNN: +4% to +11% improvement (metric not fully specified in excerpt)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2510.16083.

Fig 1

Fig 1: Graph representation of password reuse relations.

Fig 6

Fig 6: Evaluation using ranking metrics. We present the

Fig 7

Fig 7: Risk scores averaged over the edges from each

Fig 4

Fig 4 (page 18).

Fig 5

Fig 5 (page 18).

Fig 6

Fig 6 (page 18).

Fig 7

Fig 7 (page 18).

Fig 8

Fig 8 (page 18).

Limitations

The excerpt does not provide full experimental details such as train/validation/test splits, FL rounds, optimizer, batch size, or model hyperparameters, which limits reproducibility.
The dataset is breach-derived and likely highly imbalanced and temporally skewed; the excerpt does not show a temporal holdout or true future-site generalization test.
It is unclear whether evaluation prevents leakage between related websites, shared subdomains, or near-duplicate site families, which could inflate link-prediction performance.
The paper argues that gradients/weights are privacy-preserving, but the excerpt does not discuss gradient inversion, membership inference, or other FL leakage attacks.
The risk score interpretation is plausible, but the calibration procedure for converting probabilities into actionable thresholds is not fully specified in the excerpt.
The method relies on public website metadata and third-party scanners; missing, stale, or noisy public data could reduce accuracy on less visible sites.

Open questions / follow-ons

How robust is the model to temporal drift, where password-reuse behavior changes after breaches, policy changes, or major security incidents?
Can the FL setup remain private and accurate under malicious-client poisoning or gradient leakage attacks?
How well does the approach generalize to newly registered or low-visibility websites with sparse public metadata?
Can the predicted risk scores be calibrated into an operational threshold that yields actionable and low-false-positive mitigation policies for administrators?

Why it matters for bot defense

For bot-defense and credential-stuffing mitigation, this paper is useful as a site-pair risk-ranking tool rather than a direct detection engine. A defender could use the predicted website-to-website risk scores to prioritize which partner sites, login flows, or breached-credential alerts deserve stricter step-up authentication, rate limiting, or anomaly monitoring. The main operational appeal is that it tries to infer cross-site reuse risk without requiring raw credential sharing between organizations.

For CAPTCHA practitioners specifically, the result suggests a way to allocate friction more selectively. If a site pair is predicted to have high password-reuse likelihood, you might tighten bot challenges or trigger additional verification only when login traffic or account-creation flows overlap with that risk profile. The flip side is that this is still a graph-inference model trained on breach-derived data, so it should be treated as a prioritization signal, not ground truth. In practice, a defender would want to test whether these risk scores actually improve downstream outcomes like reduced stuffing success rate, fewer false positives, or better challenge placement under distribution shift.

Cite

bibtex

@article{arxiv2510_16083,
  title={ PassREfinder-FL: Privacy-Preserving Credential Stuffing Risk Prediction via Graph-Based Federated Learning for Representing Password Reuse between Websites },
  author={ Jaehan Kim and Minkyoo Song and Minjae Seo and Youngjin Jin and Seungwon Shin and Jinwoo Kim },
  journal={arXiv preprint arXiv:2510.16083},
  year={ 2025 },
  url={https://arxiv.org/abs/2510.16083}
}

PassREfinder-FL: Privacy-Preserving Credential Stuffing Risk Prediction via Graph-Based Federated Learning for Representing Password Reuse between Websites ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​