Index-Assisted Stratified Sampling for Online Aggregation

Source: arXiv:2604.28141 · Published 2026-04-30 · By Yunnan Yu, Zhuoyue Zhao

TL;DR

This paper tackles a very specific bottleneck in sampling-based approximate query processing: index-assisted online aggregation can give low-latency answers, but for skewed data or selective predicates, uniform sampling often needs too many samples to reach a target confidence interval. The authors’ core idea is to adapt stratified sampling to the realities of index-assisted sampling trees, where per-sample cost is not just a function of sample count but also of how far down the tree each stratum lies. They call the resulting system OptiAQP and frame it as a two-phase process: take an initial sample to estimate variance and candidate split points, then use those estimates to choose strata and allocate samples more efficiently for the remainder of the query.

What is new here is not stratified sampling itself, but the cost model and optimization strategy tailored to index-based sampling. They prove a modified Neyman allocation that minimizes total sampling cost when strata have different per-sample heights in the index tree, and they prove that finer stratification monotonically decreases the sampling term of the objective, yielding a V-shaped trade-off against preprocessing cost. On the implementation side, they build four stratification strategies with different overhead/effectiveness trade-offs and integrate the system into PostgreSQL on top of an aggregate B-tree sampler. The reported result is that the approach consistently improves query latency, with the paper claiming up to 3x speedup over index-assisted uniform sampling and up to 98708x over classic scan-based stratified sampling in extreme cases.

Key findings

The total cost objective is not just sample count; with index-assisted stratified sampling it becomes c0·k + Σ(n_i·h_i), where h_i is the LCA height for stratum i, so strata near the leaves are cheaper to sample than a flat sample over the whole range.
The modified Neyman allocation replaces the classic n_i ∝ σ_i rule with n_i ∝ σ_i/√h_i, and the minimized cost becomes c0·k + (Σ σ_i√h_i)^2.
Theorem 3.3 shows finer stratification cannot increase the sampling term (Σ σ_i√h_i)^2, so splitting a stratum always weakly decreases future sampling cost, though preprocessing cost grows with k.
OptiAQP uses a two-phase execution plan: phase 0 draws an initial sample n0, estimates per-stratum variance and height, and phase 1 continues with optimized stratified sampling until the target (ε, δ) confidence interval is met.
The Greedy strategy starts from strata induced by the left-most/right-most index paths and then recursively splits the highest-variance stratum; the paper says this has medium optimization overhead and is limited to single-table aggregation.
The paper reports that OptiAQP achieves up to 3x speedup over index-assisted uniform sampling and up to 98708x speedup over classic scan-based stratified sampling in extreme cases.
The authors state the implementation is in PostgreSQL using an aggregate B-tree (AB-tree) sampler and that the source code/data/artifacts are available at the anonymous 4open.science repository.
For online aggregation, the system enforces a minimum of 30 samples per stratum when the allocated sample size is smaller, to keep the CLT-based confidence interval approximation valid.

Threat model

The relevant adversary is not a malicious attacker but a difficult workload: an online analytics environment with frequently updated data, ad-hoc range queries, skewed value distributions, and selective predicates that make uniform sampling inefficient. The system assumes the sampling index is trustworthy and supports independent range sampling under snapshot isolation; it does not defend against adversarial data poisoning, index corruption, or a hostile query issuer trying to infer internal statistics. The paper’s guarantees are about unbiased estimation and confidence bounds under the sampling assumptions, not about resisting active attacks.

Methodology — deep read

The paper studies single-table approximate aggregation queries of the form SUM(e) under a range predicate x ∈ [L, U) plus an additional filter P_f, in an online aggregation setting where the user specifies a target half-width ε and confidence level 1−δ. The adversary model is not security-oriented; the relevant “threat” is operational: frequently updated data, ad-hoc range queries, and highly skewed value/selectivity distributions can make uniform sampling too slow to satisfy a confidence bound. The system assumes an index-assisted sampling backend (they focus on an aggregate B-tree / AB-tree) that can draw independent samples from a range without scanning the base table, and it assumes snapshot-style consistency so that samples come from a latest stable view. The user is assumed to know the target confidence bound, and the system is allowed to spend some initial budget n0 to estimate statistics before optimizing the remaining sampling plan.

Data and workload details are only partially specified in the excerpt, but the paper clearly evaluates on multiple real-world and synthetic datasets. The running motivating example is the US airline on-time performance dataset, using a cancellation-count query over date ranges with low baseline cancellation rates but occasional spikes (e.g., after a snowstorm). The paper also mentions a synthetic benchmark based on skewed TPC-H, and Table/Figure references indicate experiments with randomly generated query ranges and parameter sweeps for CostOpt and initial sample size n0. The excerpt does not enumerate all dataset sizes, exact train/test splits, or preprocessing beyond what is inherent to the sampling index. There is no supervised learning, so there is no labeling scheme; the key measurements are query execution time, speedup, and confidence-interval behavior.

Algorithmically, the paper first formalizes index-assisted sampling in an aggregate B-tree. A sample from a range is obtained by doing one preprocessing descent to locate the left and right boundaries of the range and then sampling leaf records by weight-guided descent. The key observation is that the per-sample cost depends on the height of the lowest common ancestor (LCA) of the range’s boundary paths, not just on the total table size. If a stratum lies in a smaller subtree, samples from that stratum start lower in the tree and are cheaper. They model total cost as c = c0·k + Σ_i n_i·h_i, where c0 is a preprocessing factor per stratum, k is the number of strata, n_i is the sample count allocated to stratum i, and h_i is the effective per-sample cost/height. With fixed strata, they derive a modified Neyman allocation via Lagrange multipliers: n_i is proportional to σ_i/√h_i, not just σ_i, and the resulting minimum cost is c0·k + Z_δ^2/ε^2 · (Σ_i σ_i√h_i)^2. A concrete end-to-end example in the paper is the flight-cancellation query: if a 20-day range contains a sharp spike around 9/11, the overall variance of the full-range estimator is high; splitting the range into strata around the spike reduces each stratum’s variance and lowers the LCA heights, so fewer total samples are needed for the same confidence interval.

The optimization framework is two-phase. Phase 0 draws an initial uniform sample S0 of size n0 using the existing index-assisted sampler, then uses those samples to estimate per-stratum variance, candidate boundaries, and LCA heights. If that initial sample is already sufficient to meet the confidence target, the system returns without stratifying further. Otherwise, the system enters phase 1 and repeatedly draws stratified samples using the optimized partition. The output estimator combines phase-0 and phase-1 estimators using a weighted average formula, and the confidence interval is updated accordingly. For each iteration, the system computes the next batch size and per-stratum allocation with the modified Neyman rule; it also enforces a minimum of 30 samples per stratum for CLT validity. The excerpt suggests periodic output during execution, not just a final answer.

For stratification itself, they implement four strategies. Greedy is a top-down heuristic that starts from strata naturally induced by the left/right paths in the AB-tree, then splits the stratum with the highest estimator variance into its child subtrees until marginal cost improvement falls below a threshold τ or the initial budget is exhausted. CostOpt is a dynamic-programming-style optimizer over the candidate boundary lattice; it aims for better cost reduction at higher optimization overhead. SizeOpt and Equal are baseline strategies based on prior stratified-sampling ideas; the paper characterizes them as varying widely in effectiveness and having smaller or less aggressive optimization. The evaluation protocol compares these strategies against index-assisted uniform sampling and classic scan-based stratified sampling, measuring execution time/speedup over several datasets and query families. The excerpt does not provide statistical significance tests, exact seeds, or full hyperparameter tables, so reproducibility on those points is unclear, but the authors do state that source code, data, and artifacts are available through the anonymous 4open.science link.

Technical innovations

A sampling-cost model for index-assisted stratification that accounts for tree height/LCA depth, rather than treating all samples as equal-cost.
A modified Neyman allocation that minimizes c0·k + Σ n_i h_i, yielding n_i ∝ σ_i/√h_i instead of the classic variance-only allocation.
A proof that finer stratification weakly decreases the sampling term (Σ σ_i√h_i)^2, establishing a monotone trade-off between stratification granularity and preprocessing cost.
A two-phase online-aggregation framework that uses initial samples both to produce answers and to estimate future stratification statistics without requiring a priori data statistics.
Four concrete stratification optimizers, including a tree-structure-guided Greedy heuristic and a more exhaustive CostOpt dynamic-programming approach.

Datasets

US airline on-time performance dataset — size not specified in excerpt — public dataset
skewed TPC-H synthetic benchmark — size not specified in excerpt — synthetic benchmark generated by authors
randomly generated query ranges — size not specified in excerpt — synthetic workload used in evaluation figures

Baselines vs proposed

Index-assisted uniform sampling: speedup = baseline vs proposed: up to 3x speedup (paper claim)
Classic scan-based stratified sampling: speedup = baseline vs proposed: up to 98708x speedup (paper claim)
SizeOpt: qualitative result only in excerpt; reported as varying a lot and generally worse than Greedy/CostOpt
Equal: qualitative result only in excerpt; reported as having the smallest overhead but highly variable effectiveness

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.28141.

Fig 15

Fig 15: Speedup over uniform with randomly generated query ranges

Fig 16

Fig 16: Varying parameters for CostOpt

Fig 17

Fig 17: Varying parameters

Fig 18

Fig 18: Execution time

Fig 19

Fig 19: Varying the initial sample size 𝑛0 for CostOpt

Limitations

The excerpt does not provide the full experimental table, so many concrete dataset sizes, exact query counts, and per-figure numeric results are not visible here.
The optimal-stratification proof assumes access to full statistics and candidate boundaries; the practical system only approximates these from an initial sample, so optimality is not guaranteed in deployment.
Greedy is explicitly limited to the single-table aggregation setting discussed in the paper; the authors note it does not directly extend to joins or other query forms without additional work.
The CLT-based confidence interval machinery requires a minimum sample count per stratum and asymptotic normality assumptions, which can be shaky for tiny strata or heavy-tailed data.
The paper’s biggest speedup claims are described as extreme cases; without the full benchmark table, it is hard to tell how often those extremes occur versus typical cases.
Reproducibility is improved by an artifact link, but the excerpt does not confirm whether code, data, and scripts are fully self-contained or whether all baselines are exactly reproducible.

Open questions / follow-ons

Can the height-aware cost model be generalized to joins, group-by, or spatial queries without losing the monotonic stratification property?
How robust is the phase-0 estimated stratification when the initial sample is very small or the distribution is highly non-stationary under updates?
Can the optimizer be adapted to multivariate predicates or correlated dimensions where 1-D range strata are a poor fit?
What is the best way to combine this with adaptive stopping rules so that stratification overhead is only paid when the expected savings are large enough?

Why it matters for bot defense

For bot-defense practitioners, this paper is mainly relevant as an example of how much performance depends on matching the sampling strategy to the underlying data-access path. The analogous lesson for CAPTCHA or bot-detection telemetry is that if your verification or scoring pipeline samples events, users, or sessions, a flat random sample may be wasteful when the data are skewed or the access path has heterogeneous cost. A stratified design can reduce latency if you can estimate which partitions are variance-heavy or expensive to inspect, but the paper also shows the downside: you need an initial measurement phase, an optimizer, and a model that is good enough to justify the extra complexity. In practice, a bot-defense team would translate this into careful partitioning of traffic or sessions by cost/variance drivers, then only apply more expensive checks where they are statistically likely to matter.

Cite

bibtex

@article{arxiv2604_28141,
  title={ Index-Assisted Stratified Sampling for Online Aggregation },
  author={ Yunnan Yu and Zhuoyue Zhao },
  journal={arXiv preprint arXiv:2604.28141},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.28141}
}

Index-Assisted Stratified Sampling for Online Aggregation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​