A GPU Accelerated Temporal Window-Based Random Walk Sampler

Source: arXiv:2605.16182 · Published 2026-05-15 · By Md Ashfaq Salehin, George Parisis, Luc Berthouze

TL;DR

This paper addresses the challenge of generating temporal random walks that strictly preserve causal (temporal) ordering on large-scale, evolving graphs. Such temporal random walks are essential for dynamic network analysis and temporal node embeddings, but current systems are either offline batch-oriented or do not enforce causality, and fail to scale to streaming billion-edge datasets. The authors propose Tempest, a GPU-accelerated streaming engine with a novel dual-index data layout and a hierarchical cooperative scheduler that dynamically dispatches walks by grouping those sharing the same node and step into thread-, warp-, or block-sized tasks. Tempest supports efficient sliding-window maintenance to bound memory, enforces strict temporal causality without synchronization overhead, and provides constant-time temporal bias samplers. Experimental results demonstrate that Tempest sustains real-time ingestion and temporal random walk sampling throughput on a real Alibaba microservices dataset with 81 billion timestamped edges, outperforming prior CPU-based temporal walk engines by up to two orders of magnitude. Detailed ablations show the scheduler’s benefits and the tradeoffs of sampler design, making Tempest a practically deployable system for large-scale streaming temporal graph analysis.

Key findings

Tempest sustains processing of 81 billion edges in 1.42 hours total (1.11 hours ingestion, 0.31 hours sampling) on NVIDIA A40 GPU, processing batches of ≈12M edges in under 760 ms total compared to batch arrival every 3 minutes.
GPU-accelerated ingestion is 76× faster than Raphtory and 84× CPU backend on Konect-Delicious dataset (301M edges), with ingestion time 0.9 seconds at this scale.
Walk sampling throughput with hierarchical cooperative scheduler is up to 9% higher than full-walk baseline on Konect-Delicious, and 5–7% better than scheduler without shared memory optimization.
Per-hop temporal neighbor sampling achieves constant time when using the proposed index-based inverse CDF samplers under uniform timestamp gaps.
Hierarchical walk dispatch reduces redundant global memory accesses by grouping walks converging on nodes and preloading node adjacency metadata into shared memory at warp/block granularity.
Sliding-window eviction and batch-wise bulk index reconstruction keep memory bounded by active window size (e.g., 120M active edges for Alibaba) with index build cost linear in window size.
Tuning the solo-to-warp walk population threshold (W_warp) at 4 maximizes throughput across multiple datasets.
Walk generation runtime remains flat across edges ranging from 1K to 301M, demonstrating scalability of cooperative scheduling and sampler designs.

Threat model

The adversary is an environment producing large streams of timestamped edges representing temporal graph updates. The system assumes edges arrive in time order (monotonic timestamps) and does not tolerate out-of-order or late-arriving edges. The adversary cannot cause retraction or arbitrary mutation of past edges once ingested. The core security assumption is strict preservation of temporal causality in random walks, thus ensuring no future-timestamp edges are selected before earlier edges in a walk.

Methodology — deep read

The authors design Tempest, a GPU-native engine for temporal random walk sampling on streaming temporal graphs, combining several innovations to handle causality preservation, large-scale streaming ingestion, and GPU efficiencies.

Threat model and assumptions focus on temporal random walks which require causality preservation: the walk can only progress to edges with strictly increasing timestamps. The adversary or system must enforce temporal ordering at each hop, given streaming batches of timestamped edges.

Data: evaluation uses large public temporal graph datasets including Konect-Delicious (301M edges) and Alibaba Microservices logs with 81B temporal edges. Datasets have node counts from tens of thousands to tens of millions. Edges arrive continuously in time-ordered batches.

Algorithm design begins with a shared edge store representing active edges in the sliding window, organized with a dual-index data structure: a timestamp-grouped global view for start edge sampling and eviction, and a node + timestamp-grouped view that clusters edges by source node and ascending timestamps to allow per-hop temporal neighborhood lookups. Both index views are offset arrays over the shared edge array, rebuilt in bulk after each batch arrival.

The walk sampling scheduler exploits hierarchical cooperative scheduling: it dynamically groups walks at the same node and step into units, then dispatches these at granularities (single thread, warp of 32 threads, or thread block of 256 threads), based on current walk population (W) and size of adjacency metadata (G). This grouping enables shared memory preloading of adjacency metadata, reducing redundant global memory access and improving GPU efficiency. Large hubs with >8192 walks are split among multiple blocks.

Temporal bias sampling is implemented with weight-based sampling over the weighted neighborhood edges using inverse CDF sampling. Under uniform timestamp gaps, closed-form O(1) samplers enable per-hop sampling constant time, instead of O(log n) binary searches.

Streaming ingestion maintains a fixed-size sliding window of edges defined by duration Δ, holding only edges whose timestamps fall inside the current window. Incoming batches are merged and sorted, evicting edges outside the window. The dual-index is reconstructed after each batch in parallel on GPU using radix sorting and prefix sums, avoiding incremental updates and synchronization overhead.

Training regime does not apply, but evaluation runs sampling experiments with walk length 80-100, 10-20 walks per node, on NVIDIA A40 GPU with CUDA 12.6. Parameters like block dimension (default 256) and solo-to-warp threshold (default 4) are tuned empirically.

Evaluation metrics include ingestion time, walk sampling throughput (million steps per second), GPU resource occupancy, and scalability with dataset size. Baselines include CPU implementation of Tempest and prior CPU temporal walk engines TEA and TEA+. Several ablation studies isolate effects of scheduler dispatch units and memory usage.

Results show linear scalability of ingestion and walk sampling time with edge counts and dataset sizes, confirming sustainable real-time streaming capabilities. Scheduler cooperation brings a consistent 5–9% speedup over naive full-walk baseline. Sensitivity to scheduler parameters and window sizes is explored. Source code is released as open source.

Technical innovations

A GPU-native dual-index edge structure with two logical views (timestamp-grouped and node+timestamp-grouped) over a shared edge array enabling O(1) temporal neighborhood lookups.
Hierarchical cooperative scheduling that dynamically dispatches temporal walks grouped by node and step to thread-, warp-, or block-sized cooperative GPU kernels, optimizing shared memory reuse and reducing global memory traffic.
Batch-based streaming ingestion with sliding-window eviction implemented entirely on GPU with radix sorting and prefix sum scans to rebuild dual indices without synchronization or incremental updates.
Closed-form constant-time samplers for temporal bias functions (uniform, linear, exponential) under uniform timestamp gaps, replacing binary search with direct inversion in temporal random walks.

Datasets

Alibaba Microservices — 81B edges, 68K nodes — public [18]
Konect-Delicious — 301M edges, 33.7M nodes — public [17]
TGBL-Coin — 22.8M edges, 638.5K nodes — public [13]
TGBL-Flight — 67M edges, 18K nodes — public [13]
Konect-Growth — 39M edges, 1.8M nodes — public [17]
TGBL-Review — 4.8M edges, 352K nodes — public [13]

Baselines vs proposed

TEA+ (CPU temporal walk engine): ingestion throughput significantly lower, unable to handle streaming billion-edge scale compared to Tempest GPU (81B edges in 1.42 hours vs TEA's out-of-core batch processing).
Raphtory (incremental temporal graph engine): Tempest GPU ingests 301M edges 76× faster (0.9 s vs ~70 s) on Konect-Delicious dataset.
Tempest CPU backend: GPU ingestion 84× faster on Konect-Delicious at 301M edges.
Scheduler variants on TGBL-Coin: full-walk baseline = 59.4 M steps/s; cooperative with global memory preloading = 60.6 M steps/s; cooperative with shared memory preloading = 63.8 M steps/s (7.4% better than baseline).
Walk sampling runtime remains essentially flat (±5%) from 1K to 301M edges, showing good scalability of GPU scheduler and sampler.
Starting walk sampling throughput 2.5–2.9× faster on GPU backend vs CPU backend across first- and second-order pickers.

Limitations

The system assumes strictly monotonic timestamps within streaming batches and drops late-arriving edges; it does not handle out-of-order or retracted updates.
Evaluation focuses on temporal random walk generation throughput but does not include downstream embedding training or end-to-end ML task performance.
No adversarial evaluation or robustness tests under adversarial streams or highly skewed timestamp distributions.
Memory usage depends on active window size which is fixed a priori; dynamic window adjustment or memory-pressure adaptation are not explored.
The scheduler relies on heuristics (W thresholds) tuned on benchmark datasets; performance on more diverse graph types is untested.
Evaluation at Alibaba scale does not include detailed profiling of kernel-level bottlenecks or power/energy consumption.

Open questions / follow-ons

How to extend Tempest to support out-of-order or late-arriving edges with retraction or correction semantics while preserving temporal causality?
Can the cooperative scheduling framework be generalized to heterogeneous hardware with multiple GPUs or distributed settings?
How do different temporal bias functions and possibly learned dynamic biases impact scalability and embedding quality downstream?
What are the tradeoffs when integrating temporal random walks generated by Tempest into full temporal graph neural network training pipelines?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the ability to efficiently generate temporal random walks on streaming temporal interaction data enables scalable embedding and anomaly detection tasks that account for temporal ordering. Tempest’s GPU-accelerated streaming approach addresses key bottlenecks in strict causality enforcement at massive scale, critical for real-time detection of evolving bots or malicious activity. The hierarchical cooperative scheduling approach can inspire design of efficient GPU-based sampling primitives within behavioral models that depend on temporal interaction paths. Moreover, Tempest’s sliding-window bounded-memory design aligns with practical deployment needs where only recent interactions matter. However, practitioners should consider the limitation that the system requires monotonic timestamped inputs without correction or adversarial manipulations, which are common in security-sensitive settings. The open question remains how to adapt such temporal walkers for adversarial scenarios or noisy timestamps which are frequent in bot detection logs.

Cite

bibtex

@article{arxiv2605_16182,
  title={ A GPU Accelerated Temporal Window-Based Random Walk Sampler },
  author={ Md Ashfaq Salehin and George Parisis and Luc Berthouze },
  journal={arXiv preprint arXiv:2605.16182},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.16182}
}

A GPU Accelerated Temporal Window-Based Random Walk Sampler ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​