Perfect $L_p$ Sampling with Polylogarithmic Update Time

Source: arXiv:2512.00632 · Published 2025-11-29 · By William Swartworth, David P. Woodruff, Samson Zhou

TL;DR

This paper addresses the problem of perfect Lp sampling in turnstile data streams, focusing on vectors x∈{−poly(n),...,poly(n)}^n updated by turnstile operations, and sampling indices with exact probability proportional to |x_i|^p/||x||_p^p up to negligible additive error. Previous work by Jayaram and Woodruff (2018) achieves optimal O~(log^2 n) bits of memory but at an explosioningly high polynomial update time of at least n^C per stream update, which is impractical for large datasets. The authors present the first perfect Lp sampler for 0<p<2 with the same optimal space complexity but with polylogarithmic update time, a massive improvement in efficiency. The key technical novelty is an efficient approximation of the sum of reciprocals of powers of truncated exponential random variables via precise evaluation of characteristic functions using the Gil-Pelaez inversion formula and advanced numerical integration techniques like the trapezoid rule. This allows them to sample from a complicated joint distribution of order statistics and partial sums accurately and efficiently, enabling a compact sketch-based heavy hitter detection scheme that operates on a duplicated vector with only polylogarithmic overhead. The method does not require a random oracle and works in the fully turnstile streaming model. Empirically, the update time is exponentially better than prior perfect samplers, while maintaining the desired exactness guarantees. This resolves a long-standing open problem in streaming algorithms.

Key findings

The proposed perfect Lp sampler achieves polylog(n) update time per stream update, improving exponentially over the previous polynomial update time of n^C by [JW18].
The sampler uses Õ(log^2 n) bits of space, matching the best known space complexity for perfect Lp sampling for 0 < p < 2.
It outputs an index i with probability exactly Pr[i^* = i] = |x_i|^p / ||x||_p^p ± 1/n^C, with arbitrarily small constant C.
By duplicating coordinates poly(n) times, the authors reduce anti-rank vector dependencies but avoid prohibitive update time via clever sketching with dense Gaussian CountSketch structures.
The approach simulates sums of reciprocals of powers of truncated exponential RVs by approximating their characteristic function and computing the cumulative distribution function via the Gil-Pelaez inversion formula and trapezoid numerical integration in polylogarithmic time.
They show that the truncated sums approximate a p/2-stable distribution with total variation error ≤1/poly(n) for large duplication parameter C.
A novel statistical test on a reduced-dimensional vector derived from top order statistics and Gaussian sketches enables testing the correctness of the heavy hitter identification.
Derandomization is achieved using the PRG of [GKM18], fooling Fourier shapes with seed length Õ(log^2 n), maintaining correctness and space bounds.

Threat model

The adversary is the streaming data source providing turnstile updates to an n-dimensional frequency vector; it may be adaptive or adaptive-only in sending updates but cannot interfere with the internal randomness of the algorithm. The adversary has no direct knowledge of the algorithm’s random bits and cannot tamper with the sketching process internally. The threat model assumes the algorithm must sample from the exact Lp distribution up to negligible additive error, despite adversarial or arbitrary update sequences.

Methodology — deep read

Threat model & assumptions: The adversary is the data stream itself, which provides turnstile updates (positive or negative increments) to a frequency vector x ∈ R^n. The algorithm must output a sample i∗ with probability exactly proportional to |x_i|^p/||x||_p^p up to negligible additive error 1/n^C for an arbitrarily large constant C. The adversary cannot modify the algorithm's internal randomness or force more than poly(n) updates.
Data: The input is a turnstile stream of updates (i_t, Δ_t), where i_t ∈[n] and Δ_t ∈{−M,...,M} with M = poly(n). The entire frequency vector x ∈ R^n is implicitly defined by x_i = sum_{t: i_t = i} Δ_t. The algorithm must process these updates sequentially and in one pass, maintaining sublinear (polylogarithmic) space and runtime.
Architecture / algorithm: The core idea is to simulate a vector z derived by scaling each coordinate x_i by 1/e_i^{1/p}, where e_i are independent exponential random variables. The index i∗ maximizing |z_i| corresponds exactly to the desired Lp sample. To remove dependencies due to large coordinates (anti-rank vector issues), they duplicate each coordinate poly(n) times, forming an expanded vector V of size N = n · n^C. However, explicitly storing V is too expensive.

Instead, they implement a dense Gaussian CountSketch: a random Gaussian matrix G ∈ R^{k×N} (k = O(log n)) multiplies the frequency vector V, and the sketch G·V is stored and updated incrementally. Estimators for entries of V are obtained via dot products with columns of G. The key challenge is efficiently simulating the norm ||V^i||_2^2 of each duplicated block corresponding to coordinate i, which involves sums of inverse powers of exponentials truncated by the top τ order statistics.

The authors approximate the distribution of these sums using characteristic functions computed via the Gil-Pelaez inversion formula, numerically integrated with a trapezoid rule to polylogarithmic precision. This enables sampling from the joint distribution of the maximum and sum of the other terms to generate Gaussian random variables consistent with the sketching process.

A statistical test on the recovered estimates of the top two coordinates and a norm approximation controls the error to ensure exact sampling probabilities. The algorithm maintains O(log n) copies (rows) of the CountSketch for high confidence.

Training regime / implementation details: The method is fully streaming; as each update arrives, the algorithm updates the estimates of H_i (top τ order stats) and N_i (Gaussian noise) for the relevant coordinate i using the dense CountSketch rows. Approximation parameters and truncation thresholds are carefully tuned to ensure polylogarithmic worst-case update time with high probability.
Evaluation protocol: The sampler is analyzed theoretically to show exact probability guarantees up to negligible additive error 1/n^C and meets the space and update time bounds. Comparison to prior works' space-time tradeoffs is summarized (Table 1). No experimental evaluation is presented; the claims are theoretical but rigorous.
Reproducibility: The authors provide detailed pseudocode (Algorithm 2), detailed proofs, and refer to implementable numerical methods (Gil-Pelaez inversion + trapezoid integration) with complexity bounds. No open source code or datasets are mentioned; the work is theoretical and mathematical in nature.

Example end-to-end: For a coordinate i, the algorithm generates the limiting distribution of the top τ order statistics H_i from exponentials, approximates the sum of the remaining exponentials f_σi^2 conditioned on H_i, samples Gaussian Ni ∼ N(0, f_σi^2), and updates the dense CountSketch sketch matrix with Δ × Ni for a stream update Δ to x_i. At stream end, the algorithm recovers estimates of each H_i, runs its statistical test, and outputs the index with the largest estimate if the test passes.

Technical innovations

Efficient polylogarithmic-time approximation of sums of reciprocals of powers of truncated exponential random variables using numerical evaluation of characteristic functions and the Gil-Pelaez inversion formula.
Use of dense Gaussian CountSketch with properly simulated Gaussian entries to enable sketching on duplicated coordinate vectors without paying polynomial update time.
Reduction of dependence on heavy hitter order statistics by duplicating coordinates poly(n) times and combining top τ order statistics and tail sums into a compact vector representation for statistical testing.
Derandomization of the sampling algorithm using the PRG of Gopalan, Kane, Meka (2018) targeting Fourier shapes, reducing randomness without increasing space or hurting guarantees.

Baselines vs proposed

Jayaram and Woodruff (2018): Update time ≥ n^C per stream update vs Proposed: poly(log n) update time.
Jayaram and Woodruff (2018): Space = Õ(log^2 n) bits vs Proposed: Same Õ(log^2 n) bits of space complexity.
Prior perfect Lp samplers for p ∈ (0,1) in insertion-only streams with O(1) update time [PW25] require random oracles, while proposed sampler works in turnstile model without random oracles.

Limitations

The algorithm requires duplicating the input vector coordinates poly(n) times to remove anti-rank dependencies, increasing the universal dimension N = n · n^C and potentially affecting constants hidden by polylog factors.
The complexity analysis relies on high probability bounds, and although worst-case update time is polylogarithmic with high probability, the paper does not explore practical runtime constants or empirical performance.
The distribution approximations and numerical integration assume ideal numerical precision and stability; the paper does not address floating point errors or implementation details.
Derandomization approach depends on advanced PRG constructions, which may add overhead and complexity in practice.
The method is limited to p ∈ (0,2) range; for p ≥ 2 lower bounds on space complexity still hold and are not addressed.

Open questions / follow-ons

Can the update time constants and overhead be reduced further to achieve practical implementations?
Is it possible to extend perfect Lp sampling to p ≥ 2 with similar polylogarithmic update time and near-optimal space?
Can the dependence on duplication parameter n^C be removed or reduced while maintaining perfect sampling guarantees?
Are there other function families G(·) besides |·|^p for which perfect streaming samplers with polylog update time are possible?

Why it matters for bot defense

Perfect Lp sampling is a fundamental primitive in streaming algorithms related to detecting heavy hitters and important features in large-scale data streams. In the context of bot-defense and CAPTCHA systems, efficient and exact sampling from data streams could be used to identify anomalous or heavy traffic patterns, or to efficiently select representative challenges or users based on dynamic interaction frequency distributions. The exponential speedup in update time provided by this work allows such streaming samplers to scale to massive streams without the computational bottleneck faced previously. Moreover, the turnstile model applies to both increases and decreases in user activity or signals, accommodating more complex real-world scenarios. The technical approach—leveraging approximation of complicated heavy-tailed distributions with tight numerical methods combined with advanced sketching and derandomization—could inspire new bot detection mechanisms based on exact probabilistic sampling with strong guarantees, potentially improving robustness and detection fidelity in adversarial environments.

Cite

bibtex

@article{arxiv2512_00632,
  title={ Perfect $L_p$ Sampling with Polylogarithmic Update Time },
  author={ William Swartworth and David P. Woodruff and Samson Zhou },
  journal={arXiv preprint arXiv:2512.00632},
  year={ 2025 },
  url={https://arxiv.org/abs/2512.00632}
}

Perfect $L_p$ Sampling with Polylogarithmic Update Time ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​