Guarded Epoch Bloom Filters for Sliding-Window Membership

Source: arXiv:2606.19210 · Published 2026-06-17 · By Faruk Alpay, Levent Sarioglu

TL;DR

This paper addresses the problem of approximate membership queries (AMQs) over streams where recent-window semantics are needed—that is, answering if an item appeared in the last W insertions, rather than ever before. Existing approaches like counting Bloom filters support deletions but require per-cell counters and deletion queues, while stable Bloom filters use random decay but may lose live items. The authors propose the Guarded Epoch Bloom Filter (GEBF), which partitions a fixed memory budget into r+1 segments (epochs). Insertions go only to the current segment, and at each epoch boundary, the next segment is cleared and becomes current. One additional guard segment preserves a deterministic invariant that every key inserted within the last W items is retained without false negatives and bounds stale positives to one extra epoch beyond W. The authors provide a rigorous construction, prove live coverage and staleness bounds, derive false-positive approximations, and propose a blocked variant that improves memory locality.

Extensive experiments on 225 synthetic streaming configurations and a real timestamped web-server access-log (213k entries, 2.3k keys) compare GEBF to counting Bloom filters and stable Bloom filters under fixed memory budgets. At 14 bits per live item, the GEBF with r=8 achieves median synthetic false-positive rates around 0.02225 compared to 0.191 for a 4-bit counting Bloom filter baseline, while preserving zero measured false negatives. On real logs, GEBF achieves nearly zero false positives and false negatives, outperforming stable filters that exhibit significant live-key false negatives. The design eliminates per-key deletion metadata and counters, reducing complexity in exchange for bounded stale positives. Results validate the deterministic live-window guarantee and favorable false-positive tradeoffs, with the main operational cost coming from queries probing multiple segments.

Key findings

Guarded epoch Bloom filter reduces median synthetic false-positive rate from 0.191 (4-bit counting Bloom) to 0.02225 at 14 bits per live item, while keeping zero live-key false negatives (Table 1).
On a timestamp-ordered real web access-log, GEBF achieves false-positive rate 0 and zero false negatives, outperforming stable Bloom filters with 0.0047 live-key false negatives.
The stale positive rate (positives on expired keys) remains bounded and controlled by the single guard epoch, e.g., 0.1469 at r=8 in the synthetic streams.
Queries require examining r+1 segments, with query throughput dropping from 46.5 Mq/s (counting BF) to 5.24 Mq/s (r=8 GEBF) on synthetic streams at 14 bits per item due to multi-segment probes.
False-positive probability can be approximated as pGE(m,W,r,k) ≈ 1 - (1 - (1 - e^{-kℓ/s})^k)^{r+1}, capturing the tradeoff between epoch granularity and bit allocation.
Staleness of positives is deterministically bounded to at most W + ℓ insertions due to guard epoch retention.
Blocked variant confines queries to one block per segment to improve cache locality, trading off some FPR increase (0.0708 vs 0.0222) for similar false-negative guarantees.
Counting Bloom filters require extra memory for 4-bit counters, reducing addressable cells; GEBF uses full memory budget on membership bits improving false-positive performance.

Threat model

The paper considers the adversary model typical for approximate membership data structures: the adversary cannot manipulate insertions or the data structure internals, only queries arbitrary keys. The guarantees focus on deterministic no-false-negative coverage of keys inserted within the last W positions and allow bounded retention of stale keys for up to one extra epoch. There is no consideration of adversarial hash collision or active tampering attacks.

Methodology — deep read

The authors start with a threat model focusing on approximate membership queries over large data streams where recent entries must be deterministically remembered within a window W, but bounded stale positives are allowable. Adversaries do not manipulate the data structure, so this is a data-structural guarantee rather than an adversarial security model.

Data for evaluation includes synthetic streams (120,000 insertions with a 20,000-item live window) under three distributions (uniform, Zipf-like, bursty) with multiple seeds and memory budgets. Queries include 20,000 live positives, 20,000 fresh negatives, and 20,000 expired keys for staleness measurement. The real dataset is a publicly available web-server access log (Organization X) with 213,182 entries and 2,350 distinct 64-bit keys derived from client-IP, method, and URL. Query sampling mimics the synthetic setup.

The GEBF architecture partitions the total m-bit budget into r+1 equal-sized bit arrays called segments, each a classic Bloom filter with k hash functions (double hashing used for efficiency). Insertions are always made into the current 'active' segment. After every ℓ = ceil(W / r) insertions, the active segment index cycles forward, clearing the next segment's bits to start fresh. The extra guard segment preserves live coverage for W insertions plus up to one extra epoch ℓ worth of staleness. Queries check all r+1 segments and return true if any contain the queried key.

The authors analyze and prove that all keys in the last W insertions survive in at least one uncleared segment (lemma 1), and staleness is bounded by W + ℓ insertions (proposition 1). They derive a closed-form approximation for false positive probability assuming uniform hashing and independent segments.

For efficiency, a blocked variant uses the first hash to select a data block and confines the k-probes to that block to improve cache locality without affecting the correctness invariant.

Training is not applicable as this is a data-structure design. Evaluation includes extensive benchmarking across synthetic and real workloads, measuring false-positive rates, false negatives on live keys, stale positives, and query throughput. Baselines are a 4-bit counting Bloom filter with exact deletion queue, and stable Bloom filters with random decay. The authors run large-scale validation runs on an NVIDIA B200 GPU, testing 200 million insertions and 20 million queries to stress throughput and correctness.

Reproducibility: The authors provide full implementation, experiment scripts, and dataset preprocessing as open artifacts allowing replication of all reported results.

Technical innovations

Introduction of guarded epochs: partitioning Bloom filter bits into r+1 rotating segments with one extra guard segment that guarantees deterministic no-false-negative live-window coverage with bounded stale positives.
Closed-form analytic approximation of false-positive probability capturing the tradeoff between epoch count r, bits per segment, and stromg window guarantee.
Blocked design variant that confines probing to a single data block per epoch segment to improve cache locality while maintaining correctness invariants.
The use of epoch rotation and segment clearing as a coarse expiration mechanism replacing expensive per-key deletion metadata or counters.

Datasets

Synthetic streaming configurations — 120,000 insertions with 20,000-item live window — publicly generated via artifact
Organization X Apache web server access log — 213,182 entries, ~2,350 distinct keys — publicly available on Zenodo per paper

Baselines vs proposed

Counting Bloom Filter (4-bit counters): median synthetic FPR = 0.1910 vs GEBF (r=8) = 0.02225
Stable Bloom Filter: median synthetic live-key FNR = 0.19 vs GEBF (r=8) = 0.0
Counting BF real log FPR = 0.0002 vs GEBF (r=8) = 0
Stable BF real log live-key FNR = 0.0047 vs GEBF (r=8) = 0
Blocked GEBF (r=8) synthetic FPR = 0.0708 vs unblocked GEBF (r=8) = 0.0222
Query throughput synthetic at 14 bits per item: counting BF = 46.5 Mq/s vs GEBF (r=8) = 5.24 Mq/s

Limitations

GEBF only approximates sliding-window membership with bounded staleness; it is not suitable where exact expiration is required.
Query costs are higher due to r+1 segment probes per query, reducing throughput compared to counting Bloom filters.
Experimental evaluation uses one real dataset from a web-server log and synthetic distributions but lacks broader trace or real-world system benchmarks.
The stale positive rate, while bounded, can still be non-negligible depending on r and memory budget; not suitable for systems intolerant to such false positives.
The false-positive approximation assumes uniform hashing and independence—real hash functions and skew might affect accuracy.
No adversarial or dynamic adversary evaluation; threat model is purely structural without attack scenarios.

Open questions / follow-ons

How does GEBF performance and false-positive behavior hold under adversarial inputs or adaptive query workloads?
Can the guard segment and epoch parameters be dynamically tuned in response to workload changes to optimize tradeoffs?
What are the practical latency and throughput implications in real-time high-throughput systems or when combined with parallelization?
How does GEBF integrate or interact with privacy or security constraints in streaming data retention?

Why it matters for bot defense

For bot-defense or CAPTCHA-like systems performing streaming approximate membership queries (e.g., tracking recent IPs or query patterns), the Guarded Epoch Bloom Filter provides a promising alternative to counting Bloom filters to maintain a recent-window view without complex deletion bookkeeping or counters. The deterministic no-false-negative guarantee for the recent window ensures bot signatures or suspicious tokens inserted in a sliding window remain detectable, while the bounded stale positives represent a known, limited false positive overhead.

Practitioners can leverage GEBF to reduce per-key metadata and improve memory efficiency, particularly when exact expiration is not critical but low false negatives are. The explicit false-positive bounds and easy parameterization via epoch count r enable operator control over stale retention limits, crucial in environments where the risk from stale positives can be tolerated. The increased query cost due to multiple segment probes must be weighed against improved false-positive rates and implementation simplicity. Overall, GEBF complements the landscape of sliding-window AMQs by providing a middle ground between exact deletion and randomized aging approaches.

Cite

bibtex

@article{arxiv2606_19210,
  title={ Guarded Epoch Bloom Filters for Sliding-Window Membership },
  author={ Faruk Alpay and Levent Sarioglu },
  journal={arXiv preprint arXiv:2606.19210},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19210}
}

Guarded Epoch Bloom Filters for Sliding-Window Membership ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​