MLSkip: Data Skipping for ML Filters via Lightweight Metadata

Source: arXiv:2606.03946 · Published 2026-06-02 · By Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf

TL;DR

This paper addresses the novel challenge of enabling data skipping for machine learning (ML) filters embedded in database query predicates, where traditional metadata-aware pruning techniques fail. ML filters, especially those powered by complex black-box models like neural networks with ReLU activation, prevent standard data skipping approaches that rely on simple min-max statistics on numerical columns. The authors initiate the first study for data skipping in ML filters by leveraging Parquet's default min-max column metadata combined with neural network verification methods to prune non-qualifying row groups safely.

The main technical contribution is the MLSkip framework, which uses existing lightweight metadata (min-max ranges) to encode input constraints and verify if the ML model output can satisfy the filter predicate. They also propose enhanced metadata in the form of a size-bounded 2D convex hull over column pairs to improve pruning precision. Using TPC-H and TPC-DS benchmarks with trained ReLU networks on 2-4 features, they demonstrate average pruning effectiveness of 27.4% with min-max metadata, boosted to 38.31% with bounded convex hull metadata, at a modest metadata overhead (≤45 bytes per column pair and row group). This translates into an end-to-end query speedup of 1.07× over PyTorch inference in DuckDB, showing practical benefits for database ML filter execution.

Key findings

Parquet's default min-max metadata enables data skipping for ReLU ML filters with an average pruning effectiveness of 27.4% on filters with selectivity ≤0.1% in TPC-H and TPC-DS benchmarks (scale factor 1).
Enhanced metadata using a bounded 2D convex hull increases pruning effectiveness to 38.31%, about 10.9 percentage points higher than min-max alone.
ConvexHull metadata (without bounding vertices) achieves 39.3% pruning but at a prohibitive metadata size up to 304 bytes per column pair and row group.
BoundedConvexHull metadata has a size bounded by 45 bytes per row group and column pair, controlling storage overhead effectively.
End-to-end query speedup across all filters with 2 hidden layer ReLU networks is 1.07× over PyTorch in DuckDB, correlating positively with row group pruning ratio (see Fig. 4).
Pruning time per row group verification is faster with Marabou than with ML-QL; bounded convex hull metadata decreases verification time compared to unbounded convex hulls.
Row group size affects pruning effectiveness: pruning on 10K row groups is less effective than on 1K row groups due to coarser metadata granularity (Fig. 2).
The metadata build time overhead for bounded convex hull is about 31.74 ms max per row group and column pair, more than min-max (12.09 - 20.09 ms) but deemed acceptable.

Threat model

The threat model assumes a benign but resource-constrained querying system executing SQL queries with embedded ML filters that are treated as black-box ReLU neural networks. The objective is to safely prune data row groups that provably cannot satisfy the ML predicate, reducing computation and I/O. The verifier does not trust the ML model but must guarantee sound pruning; false positives in pruning are disallowed (never skip qualifying data). The adversary is not explicitly malicious but the system must handle filter predicates whose exact evaluation is costly. The verifier assumes access to reliable min-max metadata and enhanced metadata built on data snapshots; attackers cannot manipulate this metadata arbitrarily.

Methodology — deep read

The authors start with the threat model of a database system executing SQL queries that include ML filter predicates implemented through black-box ML models (specifically feed-forward ReLU neural networks with 1-2 hidden layers). These models take 2-4 input columns and produce a scalar output filtered against a threshold (e.g., NN output > 0.9). The adversary is implicitly the system querying data, and the goal is to prune (skip) row groups from disk I/O if the model output provably cannot satisfy the filter predicate given the row group's data ranges.

Data provenance involves tables from the industry-standard TPC-H and TPC-DS benchmarks at scale factor 1. Models are trained on the first 2,000 rows of these datasets using 10 regression templates proposed by OpenAI GPT5.5 to learn column prediction from a subset of 2-4 input columns. The trained NNs have 1 or 2 hidden layers with 32 nodes each (up to 1,317 parameters total). A total of 1,376 ML filter predicates with varying selectivities are generated by applying output range conditions on the trained models.

The main algorithmic technique leverages Parquet's built-in metadata: min and max value statistics per column per row group. The input constraints correspond to a multi-dimensional rectangle (Cartesian product of min-max intervals). The core idea is then to use neural network verification tools to check whether the NN output can satisfy the filter constraint for any input within that rectangle. Two verification tools are benchmarked: Marabou (a mature SMT solver for ReLU NNs) and ML-QL (a SQL-query-language-based approximate verifier). If verification proves no output in the target range, the row group can be skipped.

To improve on min-max metadata, the authors introduce enhanced metadata structures capturing tighter input bounds: a vanilla 2D convex hull around column pair data points and a bounded convex hull formed by partitioning the input rectangle into a grid and approximating occupied cells with a convex polygon. This limits the number of vertices and metadata size, allowing faster verification.

The training regime involves standard supervised learning on 2k samples with small batch sizes, but the paper does not delve deeply into training details or hyperparameters. Verification runtime and metadata build overhead are measured on an Intel Xeon Gold 5318Y CPU with 24 cores and 128GB RAM under Ubuntu 24.04.

Evaluation metrics include pruning effectiveness (percentage of prunable row groups actually pruned), metadata size and build time overhead per row group, verification runtime per row group, and overall end-to-end query speedup running PyTorch inference in DuckDB. They evaluate pruning efficacy for different filter selectivity buckets (from essentially 0% up to 10%) and different row group sizes (1,000 vs 10,000 rows). Baselines are min-max metadata with Marabou and ML-QL verifiers.

One concrete example: For a filter on sanity_score_nn(population_density, housing_cost) > 0.9, they use the min-max range on these two columns as input constraints, then invoke Marabou to verify if any input vector in that rectangle yields output > 0.9. If verification fails, the row group is pruned from processing.

The authors release code and datasets at https://github.com/mlskip/mlskip. However, ML-QL only supports single hidden layer models currently and is slower than Marabou. Exact verification produces safe pruning but some approximation is used to keep verification practical.

Overall, the methodology combines modern neural network verification—that formally reasons about piecewise linear functions modeled by ReLU networks—with simple but informative data statistics to enable practical pruning of data at query time.

Technical innovations

First use of neural network verification tools (Marabou, ML-QL) combined with simple min-max metadata to enable safe pruning of data row groups for ML filter predicates in databases.
Proposal of bounded convex hull metadata—a size-bounded, piecewise linear geometric abstraction over pairs of columns to tighten input constraints for verification, fitting within a maximum of 45 bytes per row group column pair.
Demonstration that Parquet's default min-max metadata, although crude, already enables meaningful pruning for ReLU networks at low selectivity, contradicting prior assumptions that ML filters cannot benefit from data skipping.
Integration of a geometric representation of ReLU networks (ML-QL) into verification queries expressed as SQL, leveraging database query optimization for verification at query time.

Datasets

TPC-H scale factor 1 — ~1 million rows total, public industry benchmark
TPC-DS scale factor 1 — ~1 million rows total, public industry benchmark

Baselines vs proposed

Marabou with min-max metadata: pruning effectiveness = 27.4% on filters ≤0.1% selectivity vs MLSkip bounded convex hull metadata: pruning effectiveness = 38.31%
ConvexHull metadata (unbounded): pruning effectiveness = 39.3% but metadata size up to 304 bytes, vs bounded convex hull metadata: max 45 bytes per row group and column pair
ML-QL single hidden layer verification time is ~2x slower than Marabou on min-max metadata; batching reduces this gap
End-to-end speedup over PyTorch in DuckDB: 1.07× when using bounded convex hull metadata and 2 hidden layer models

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03946.

Fig 1

Fig 1: Row group metadata variants: (i) vanilla min-max

Limitations

Verification currently practical only with models up to 2 hidden layers on low-dimensional inputs (2-4 features); scalability to deeper networks or larger feature sets is unclear.
ML-QL verification tool is limited to single hidden layer networks and exhibits slow compilation times (~0.8s per model), limiting applicability.
Metadata construction overhead for convex hull variants is about 2-3x slower than min-max statistics, which could affect ingestion pipelines.
Verification is conservative/approximate; pruning is safe but may miss pruning opportunities (false negatives), especially for complex models and high selectivity filters.
Row group pruning effectiveness decreases notably as row group sizes increase (10K vs 1K), limiting benefit for coarser granularity storage layouts.
The approach currently applies primarily to numerical features; verification and metadata design for categorical or string ML filters remain future work.

Open questions / follow-ons

How to scale the verification techniques and metadata designs to deeper networks, larger input feature dimensions, or more complex ML models like Transformers?
Can exact verification methods replace the current approximations to improve pruning accuracy without excessive runtime cost?
How to optimize metadata selection dynamically based on query workload to minimize overhead yet maximize pruning?
How to extend data skipping and verification approaches for ML filters operating on high-cardinality categorical or unstructured data types?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, MLSkip offers a novel angle on accelerating ML-based predicate evaluation in database and query processing contexts by pruning irrelevant data upfront. While the work specifically targets database engines using ML filters rather than direct adversarial filtering or challenge-response techniques, the underlying principle—leveraging lightweight metadata combined with formal model verification—may inspire similar efficient pruning or gating mechanisms in bot detection pipelines. For example, in CAPTCHA or bot-detection analytics that apply ML scoring on user data, analogous metadata and verification strategies may reduce unnecessary evaluation overhead or false inference workloads.

Furthermore, as ML filters grow prevalent in semantic query engines or AI-enriched security systems, MLSkip’s methods provide a framework to integrate safe, scalable skipping of irrelevant inputs before expensive model evaluation, improving throughput and latency. However, this work does not address adversarial robustness or manipulation directly, so practitioners should carefully consider threat models before applying similar verification-driven pruning in security-sensitive authentication flows.

Cite

bibtex

@article{arxiv2606_03946,
  title={ MLSkip: Data Skipping for ML Filters via Lightweight Metadata },
  author={ Mihail Stoian and Mark Gerarts and Pascal Ginter and Andreas Zimmerer and Jan Van den Bussche and Andreas Kipf },
  journal={arXiv preprint arXiv:2606.03946},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03946}
}

MLSkip: Data Skipping for ML Filters via Lightweight Metadata ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​