Sublinear Sketches for Approximate Nearest Neighbor and Kernel Density Estimation

Source: arXiv:2510.23039 · Published 2025-10-27 · By Ved Danait, Srijan Das, Sujoy Bhore

TL;DR

This paper asks whether ANN and KDE sketches can be made truly sublinear in streaming settings without giving up useful approximation guarantees. The answer is yes, but under different assumptions for the two problems. For ANN, the authors revisit the classic LSH-based Motwani–Indyk style construction and show that if the input stream is modeled as arising from a Poisson point process, then it is enough to keep only a sublinear sample of the stream, yielding a sketch of size O(n^{1+ρ-η}) and sublinear query time. They also extend the ANN result to batch queries and a restricted turnstile setting where adversarial deletions per local neighborhood are bounded.

For A-KDE, the paper targets the sliding-window model, where old points expire and standard RACE-style sketches do not directly apply. The main idea is to combine RACE with Exponential Histograms inside each cell so that counts can expire with the window while preserving an approximate kernel-density estimate. This produces, according to the paper, the first theoretical sublinear sketch guarantee for sliding-window A-KDE, with space O(RW · 1/(sqrt(1+ε)-1) · log^2 N). Experiments on ANN datasets (sift1m, fashion-mnist, and synthetic syn-32) and KDE datasets (News Headlines, ROSIS Hyperspectral Images, plus synthetic Gaussian mixtures) show low empirical error and practical lightness, though the theoretical claims depend on distributional assumptions and the empirical section is relatively modest in scope.

Key findings

For streaming ANN, the sketch stores only O(n^{1-η}) sampled points, yet the overall space bound is O(n^{1+ρ-η}/p1) words; this is the paper’s claimed sublinear trade-off under Poisson-distributed inputs.
The ANN query succeeds with failure probability at most 1/(3n^η) + (e^{mp}+e^{-1})/(e^{mp}+1) in Theorem 3.1, where m is the Poisson mean for points in an r-ball and p = n^{-η}.
In the ANN analysis, setting k = ceil(log_{1/p2} n) makes the expected number of far collisions per hash table at most n^{-η}, enabling a Markov bound of 1/(3n^η) for the bad-collision event E2.
The paper claims sublinear ANN memory for all η ≥ 0.5 when ε = 0.5, and says that for sufficiently large ε there is a threshold η* above which the sketch stays sublinear without hurting performance.
On sift1m, the proposed ANN method beats the Johnson–Lindenstrauss baseline once ε is roughly above 0.7–0.8; on fashion-mnist, the crossover is around ε ≈ 0.9.
For fixed workloads, the ANN sketch has higher query throughput than the JL baseline across the reported datasets and parameter settings.
For sliding-window A-KDE, the proposed sketch size is O(RW · 1/(sqrt(1+ε)-1) · log^2 N), which is explicitly larger than plain RACE by a log^2 N factor but is the first theoretical guarantee the authors claim for this model.
In A-KDE experiments, the empirical relative error is reported to be much smaller than the worst-case theoretical bound, and accuracy is comparable to RACE on News Headlines, ROSIS Hyperspectral Images, and synthetic data.

Threat model

For ANN, the adversary is an online stream generator that may be arbitrary in the worst case, but the theoretical guarantees assume the stream locally behaves like a Poisson point process and that the number of points in any radius-r neighborhood is sufficiently large on average. For the turnstile variant, an adversary can insert and delete points, but deletions are limited to at most d points per r-ball; the method cannot guarantee success if the adversary can wipe out all near neighbors in a local region. For A-KDE, there is no adversarial searcher model in the security sense; the challenge is streaming expiration, not malicious queries, so a security-style threat model is not central.

Methodology — deep read

The ANN part is framed in a streaming similarity-search threat model: a point set P of size at most n arrives online, updates are insertions in the main streaming setting, and queries ask for a (c,r)-approximate neighbor. The authors explicitly note that a worst-case adversary can force near-linear storage, so they assume a natural distributional model instead: the number of points inside any radius-r ball is modeled as Poisson with mean m, and they require m ≥ C n^η for some constant C > 0. The ANN sketch uses an LSH family H with sensitivity parameters (r, cr, p1, p2) and the standard ρ = log(1/p1)/log(1/p2) exponent. The algorithm samples each arriving point with probability n^{-η}, then inserts retained points into L = n^ρ / p1 hash tables using concatenated hashes g_j. At query time, it retrieves all points from buckets matching the query under each g_j, deduplicates candidates, and returns the closest candidate if it lies within cr; otherwise it outputs NULL. The key proof decomposes success into two events: E1, that at least one retained near point collides with the query in some table, and E2, that the total number of far-point collisions across all tables stays below 3L so the candidate set is not swamped. E2 is bounded using the fact that with k = ceil(log_{1/p2} n), a far point collides in one table with probability at most p2^k ≤ 1/n, so the expected number of far collisions after subsampling is at most L n^{-η}; Markov’s inequality gives failure probability at most 1/(3n^η). E1 is bounded by first using Poisson thinning to show that the chance a radius-r ball still contains a retained point after n^{-η} sampling is 1 - e^{-mp}, then lower bounding the collision success across L tables with standard LSH amplification. The ANN result is therefore not worst-case robust; it is a probabilistic guarantee that leans heavily on the Poisson assumption and on the local density m being large enough to survive subsampling.

The data used for ANN experiments is named concretely: sift1m, fashion-mnist, and a synthetic syn-32 dataset designed to emulate a Poisson point process. The paper does not give all preprocessing details in the excerpted text, but the intent is to test the sublinear sketch in realistic high-dimensional retrieval settings and in a synthetic setting aligned with the theory. The reported empirical protocol varies ε and η, compares against a Johnson–Lindenstrauss baseline, and measures both error and query throughput. One concrete end-to-end query path is: a query q arrives, the system hashes q into each of the L tables, fetches all sampled points in the matching buckets, removes duplicates, computes true distances to the candidate set, and returns the nearest candidate if it falls within cr. This is exactly the classical LSH retrieval flow, except that the sketch only contains a random n^{-η} fraction of the input stream, so the candidate set is expected to remain small under the paper’s density assumptions.

The turnstile extension keeps the same ANN machinery but adds a restriction on deletions: the adversary may delete at most d points from any r-ball, with d ≤ mp in the theorem statement. The argument uses a Poisson tail bound and Poisson thinning: after sampling, the number of retained points in a ball remains Poisson with mean mp, and the probability that deletions erase too much local mass is controlled. The proof text in the excerpt is truncated before the full theorem statement is shown, so the exact final failure bound for Theorem 3.3 is not fully visible here; however, the structure is clear enough to see that the same E1/E2 logic is being adapted to a strict-turnstile regime rather than rederived from scratch. Batch queries are handled by treating a batch Q = {q_i}_{i=1}^B as B independent ANN queries and union-bounding the failure probability, with the practical claim that the structure is easy to parallelize.

For A-KDE, the problem is defined in the sliding-window model: only the most recent N updates matter, and the goal is to approximate the kernel density at query x over those active points. The authors build on RACE, which hashes each point into an L x R counter array and estimates KDE by averaging the counts of the query’s buckets. The weakness of vanilla RACE in this setting is that it has no mechanism for expiration, so it cannot directly forget old points when the window advances. The proposed fix is to place an Exponential Histogram inside each RACE cell so that each bucket counter tracks only the last N updates and old contributions can be dropped with bounded error. The resulting sketch counts how many active-window points hash to the same bucket as the query while preserving approximate multiplicative accuracy, at the cost of an extra log^2 N factor in space. The paper also says this construction extends naturally to batch updates, where the window is defined over N batches rather than N individual items. In the experiments, the authors test synthetic Gaussian mixtures (50 Monte Carlo trials, 10,000 points each, 200 dimensions, 10 Gaussian distributions) and two real datasets, News Headlines and ROSIS Hyperspectral Images. They report that empirical relative error is much lower than the worst-case bound and that their A-KDE accuracy is comparable to RACE on all three datasets. Reproducibility details such as code release, random seed strategy, or exact hyperparameter tables are not present in the provided excerpt, so those cannot be verified here.

Technical innovations

Uses Poisson-point-process assumptions to justify sublinear sampling for streaming ANN, rather than relying on worst-case guarantees that would force near-linear storage.
Combines standard LSH amplification with uniform sampling to get a streaming ANN sketch that stores only O(n^{1-η}) points yet still answers queries with sublinear work.
Extends ANN to a restricted turnstile model by bounding deletions per r-ball and reusing Poisson thinning/tail arguments.
Integrates Exponential Histograms into each RACE cell to make approximate KDE work in the sliding-window model, where ordinary RACE cannot expire stale points.
Provides a batch-query extension for both ANN and A-KDE by parallelizing independent query evaluations over the same sketch.

Datasets

sift1m — not specified in excerpt — public benchmark dataset
fashion-mnist — not specified in excerpt — public benchmark dataset
syn-32 — synthetic dataset emulating a Poisson point process — generated by authors
News Headlines — not specified in excerpt — source not specified in excerpt
ROSIS Hyperspectral Images — not specified in excerpt — source not specified in excerpt
synthetic Gaussian mixtures — 50 sets of 10,000 points each, 200 dimensions, 10 multivariate Gaussian distributions — generated by Monte Carlo simulation

Baselines vs proposed

Johnson–Lindenstrauss baseline: ANN throughput/error comparison reported; proposed method overtakes it beyond ε ≈ 0.7–0.8 on sift1m
Johnson–Lindenstrauss baseline: ANN throughput/error comparison reported; proposed method overtakes it beyond ε ≈ 0.9 on fashion-mnist
RACE [CS20]: A-KDE accuracy reported comparable on News Headlines, ROSIS Hyperspectral Images, and synthetic data; no exact numeric delta given in excerpt

Limitations

The ANN guarantee depends on a Poisson point-process model and an assumption that each r-ball has mean mass m ≥ C n^η, so the result is not worst-case and may not hold under adversarial or highly clustered data.
The turnstile ANN extension is restricted: deletions are bounded per local ball (strict turnstile flavor), so arbitrary local annihilation of near neighbors is not allowed.
The excerpt does not provide the full experimental protocol details needed for exact reproduction: seeds, hyperparameter grids, preprocessing steps, and hardware are not fully specified here.
The A-KDE theorem pays an extra log^2 N factor in space relative to plain RACE, which may matter in long windows.
The paper claims sublinear space/time and low error, but the provided excerpt contains few concrete numeric error bars for the experiments beyond qualitative crossover points and dataset names.
No adversarial distribution-shift evaluation is shown; the empirical tests appear to be static datasets or controlled synthetic streams rather than evolving, worst-case streams.

Open questions / follow-ons

Can the ANN sampling idea be extended beyond Poisson neighborhoods to weaker stochastic assumptions or to data-dependent cluster models with provable local density bounds?
Can the turnstile ANN result be strengthened from bounded-per-ball deletions to fully adversarial deletions without losing sublinear memory?
Can sliding-window KDE be made tighter than the log^2 N overhead by using a different expiration structure or by adapting RACE more directly to time decay?
How do these sketches behave under real nonstationary streams, where the Poisson mean m or kernel bandwidth effectively drifts over time?

Why it matters for bot defense

For bot defense, the ANN result is mostly relevant as a design pattern: if your system stores large streams of embeddings or feature vectors for retrieval, the paper suggests that sublinear sketches can be practical when the data is locally dense and not adversarially structured. That matters for large-scale candidate retrieval, abuse clustering, or near-duplicate detection where you only need approximate matching and can tolerate a random-sampling front end.

For CAPTCHA or bot-detection pipelines specifically, the A-KDE part is more directly relevant. Sliding-window density estimation is a natural tool for tracking recent behavior over bounded time horizons, such as request-rate patterns, embedding distributions, or session-level anomalies. The paper’s main lesson is that if you need windowed KDE at scale, you can combine hash-based counting with an expiration mechanism instead of recomputing densities from scratch. The caveat is that the guarantee is approximate and the memory savings come with extra logarithmic overhead, so engineers would need to decide whether the added complexity is justified versus simpler windowed counters or sketches already used in their stack.

Cite

bibtex

@article{arxiv2510_23039,
  title={ Sublinear Sketches for Approximate Nearest Neighbor and Kernel Density Estimation },
  author={ Ved Danait and Srijan Das and Sujoy Bhore },
  journal={arXiv preprint arXiv:2510.23039},
  year={ 2025 },
  url={https://arxiv.org/abs/2510.23039}
}

Sublinear Sketches for Approximate Nearest Neighbor and Kernel Density Estimation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​