Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection

Source: arXiv:2509.20339 · Published 2025-09-24 · By Mohsen Nayebi Kerdabadi, William Andrew Byron, Xin Sun, Amirfarrokh Iranitalab

TL;DR

This paper addresses the challenge of detecting Account Takeover (ATO) fraud in consumer banking, where attackers gain unauthorized access to customer accounts to conduct high-risk transactions. Existing production fraud detection systems rely largely on tabular models like XGBoost, which treat each session independently and fail to capture relational and temporal dependencies common in coordinated fraud rings. The authors propose ATLAS, a novel framework that formulates ATO detection as a spatio-temporal node classification problem on a directed session graph that respects causal, time-based ordering. By linking sessions via shared entities (account, device, IP) and imposing time-window and recency constraints to enforce causality and control graph size, ATLAS enables a Graph Neural Network (GraphSAGE) to perform time-respecting message passing and serve-time consistent lagged label propagation without leakage.

They operationalize this approach at scale on a dataset of over 100 million sessions and 1 billion edges from a high-risk digital banking product at Capital One. Experimentally, ATLAS outperforms the XGBoost baseline by 6.38% AUC and concurrently reduces customer friction by over 50%, improving the precision-recall tradeoff critical to fraud detection. This demonstrates that exploiting intrinsic spatio-temporal graph structures through carefully designed graph construction and inductive GNN modeling can measurably improve both fraud capture and user experience under production constraints of latency and class imbalance.

Key findings

ATLAS improves ROC AUC by +6.38% relative to XGBoost baseline on Segment 1 of Capital One dataset, rising from 78.88 to 83.92 (Table 1).
Overall AUC improvement by GNN with label propagation is +5.8% relative to XGBoost (84.46 vs 79.83).
Increasing recency cap K from 1 to 10 yields steady ROC AUC improvements, showing more historical context benefits detection.
Extending temporal window T from 1 to 120 days consistently improves ROC AUC, highlighting value of longer causal context.
Incorporating serve-time consistent lagged labels from prior connected sessions (non-anticipative label propagation) boosts performance.
Using GraphSAGE with attention and relational aggregation shows limited gain over simpler homogeneous GraphSAGE with label propagation.
ATLAS reduces customer friction by more than 50% alongside improved fraud capture, addressing the capture-friction tradeoff effectively.
Inductive GraphSAGE with neighbor sampling enables scaling to 100M+ nodes and ~1B edges while enforcing strict latency constraints.

Threat model

The adversary is a fraudster who attempts to gain unauthorized access to legitimate consumer bank accounts by leveraging shared identifiers (account, device, IP) and coordinated sessions to execute fraudulent high-risk transactions. The adversary can coordinate attacks across multiple sessions but cannot predict future legitimate session labels or influence data used for training and scoring, which are enforced to be non-anticipative and leakage-free at serve time. The system assumes attackers do not have knowledge of the internal graph construction or model parameters but tries to detect their evolving patterns effectively within constrained latency budgets.

Methodology — deep read

Threat Model & Assumptions: The adversary is a fraudster who gains unauthorized access to legitimate consumer bank accounts via credential stuffing, phishing, or device/IP spoofing, aiming to execute high-risk transactions (HRTs). They may coordinate attacks across multiple entities (accounts/devices/IPs), forming 'fraud rings.' The defender must detect fraud rapidly (<250ms latency) with high recall and minimal false positives to avoid customer friction. The model assumes only labels adjudicated and known by serve time are available, forbidding leakage or anticipation of future fraud outcomes.
Data: The dataset consists of tens of millions of sessions involving HRTs from a Capital One digital product across two anonymized segments. The label is binary (fraud or not). Data is split chronologically into 8 months training, 2 months validation, and 5 months testing to test temporal generalization. Numerical features are standardized. Exact descriptive stats are confidential. The final graph has over 100M nodes and about 1 billion edges, indexing sessions by (account, device, IP, timestamp).
Architecture & Algorithm: The problem is reformulated as spatio-temporal node classification over a time-respecting directed acyclic graph where nodes are individual sessions. Edges are created from past to future sessions sharing an entity (account/device/IP) if within a sliding time window T and capped to the K most recent predecessors per session to control neighborhood size and latency. Lag-aware label propagation aggregates historical neighbor labels which are adjudicated and known before serve time, producing features such as counts and rates of nearby known fraud neighbors. The node feature input combines curated session-level tabular features with these aggregated lagged label features.

The GNN encoder is based on GraphSAGE (Hamilton et al. 2017) with mini-batch neighbor sampling respecting the same T and K constraints to avoid training-serving skew. Variants include homogeneous (single aggregator), relational (per edge-type aggregation fused by learned transforms), and time-aware attention-based aggregators incorporating binned time gaps and edge-type embeddings. Final node embeddings go through a logistic layer for fraud probability output. The loss is weighted binary cross-entropy to account for extreme class imbalance.

Training Regime: Model training uses neighbor sampling with fanouts per GNN layer (commonly 2-3 layers). Details on epochs, batch size, optimizer, hyperparameters and hardware are not explicitly reported. Chronological splits prevent knowledge leakage. Feature standardization is based only on training set statistics. The system uses dynamic out-of-core loading through PyTorch Geometric’s NeighborLoader to scale efficiently.
Evaluation Protocol: Metrics are ROC AUC on chronological holdout test sets for two product segments. Baselines are well-tuned XGBoost models trained with the same tabular features. Ablations include GNN without label propagation and varying T and K parameters. Improvements are reported as relative ROC AUC increments over XGBoost. No adversarial or distribution shift testing is reported.
Reproducibility: Due to regulatory and privacy constraints, the dataset cannot be published, and code release status is unclear. However, sufficient architectural and methodological details are provided to conceptually reproduce results with a similar dataset.

Example end-to-end: For a target session v at serve time t_v, the graph gathers up to K most recent past sessions u sharing account/device/IP within last T days and adjudicated before t_v. It computes neighbor aggregate fraud-label features nlab_v, nfraud_v, fraud rate r_v, and any fraud flag a_v, appends these to session features x_v as node input h(0)_v, then applies GraphSAGE layers with neighbor sampling constrained by T and K, producing embedding h(L)_v. A logistic classifier outputs fraud score s_v, trained to optimize weighted cross-entropy loss over a training set of millions of past sessions.

Technical innovations

Reformulating ATO detection as spatio-temporal node classification on a time-respecting directed acyclic session graph with causal edges based on shared entities and temporal constraints.
Serve-time consistent lag-aware label propagation aggregating only adjudicated past neighbor fraud labels, ensuring non-anticipative, leakage-free features.
Operationalizing scalable inductive GraphSAGE with neighbor sampling under time-window T and recency cap K for stable graph neighborhoods, enabling training and inference on 100M+ node graphs within latency budgets.
Incorporation of typed edges (account/device/IP) with relational GraphSAGE variants and optional time-aware attention aggregators focusing message passing on meaningful temporal relationships.

Datasets

Capital One high-risk digital product sessions — 100M+ nodes, ~1B edges — proprietary, not public

Baselines vs proposed

XGBoost baseline: overall ROC AUC = 79.83 vs GNN + Label Propagation: 84.46 (+5.8%)
XGBoost Segment 1: ROC AUC = 78.88 vs GNN + Label Propagation: 83.92 (+6.38%)
XGBoost Segment 2: ROC AUC = 82.45 vs GNN + Label Propagation: 85.45 (+3.63%)
GNN (no label propagation): overall ROC AUC = 82.27 (+3.06% vs XGBoost)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2509.20339.

Fig 1

Fig 1: ATLAS graph formulation. Nodes are HRT sessions keyed by (account_id, device_id,

Fig 2

Fig 2: Effect of K (left) and T (right) on ROC AUC for Segment 1; similar trends hold for

Limitations

Confidential dataset prevents public code or data release, limiting reproducibility and external validation.
Lack of reported adversarial robustness testing against adaptive fraud attackers or attack simulations.
No detailed latency or throughput benchmarks provided for production inference beyond stated compliance.
Model evaluation limited to chronological splits on one institution's product segments; generalization across institutions or products untested.
GNN architectural variants show limited gains beyond simpler homogeneous GraphSAGE, suggesting room to explore more advanced temporal models.
Potential concept drift and changing fraud patterns beyond the datasets’ time range are not explicitly addressed.

Open questions / follow-ons

How does ATLAS perform under adversarial scenario modeling or adaptive attacker strategies evading graph-based detection?
Can temporal graph neural networks with dynamic edge updating or finer-grained time encoding further improve latency-accuracy tradeoffs?
What are the impacts of concept drift over longer time periods beyond the study’s scope, and can online or continual learning techniques extend ATLAS?
How transferable is the graph construction and modeling approach to other fraud domains or different financial institutions with disparate data?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper illustrates the power of explicitly modeling relational and temporal structure in behavioral data to detect automated and coordinated malicious activity at scale with strong latency constraints. The ATLAS approach shows that incorporating time-respecting directed graphs and non-anticipative label propagation can capture subtle cross-session fraud patterns missed by traditional tabular models. While CAPTCHAs primarily target behavioral anomaly signals per session or device, augmenting detection pipelines with spatio-temporal graph representations could improve detection of coordinated bot rings or replay attacks in fraud-sensitive applications. The latency-conscious design and neighbor sampling strategies also provide a useful reference for deploying graph models in real-time risk scoring systems.

Cite

bibtex

@article{arxiv2509_20339,
  title={ Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection },
  author={ Mohsen Nayebi Kerdabadi and William Andrew Byron and Xin Sun and Amirfarrokh Iranitalab },
  journal={arXiv preprint arXiv:2509.20339},
  year={ 2025 },
  url={https://arxiv.org/abs/2509.20339}
}

Spatio-Temporal Directed Graph Learning for Account Takeover Fraud Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​