Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids
Source: arXiv:2605.08036 · Published 2026-05-08 · By Mads Greisen Højlund, August Smart Lykke-Møller, Henry Moss, Ove Christiansen
TL;DR
CUTS-GPR addresses the long-standing dual bottleneck of Gaussian process regression (GPR): the cubic cost in the number of training points N and the exponential cost in dimensionality D. Classical exact GPR costs O(N³) in compute and O(N²) in memory, and even Kronecker-structured methods that achieve near-linear scaling in N still require complete Cartesian product grids whose size grows exponentially with D. CUTS-GPR breaks both barriers simultaneously by pairing an additive kernel with a structured incomplete grid — specifically one whose mode combination range is 'closed under taking subsets' (CUTS), a property that guarantees the multi-index set is also 'closed under decrements' (CUD). This CUD property, formalized by Holzmüller and Pflüger, is the algebraic key that lets the authors factor the chopped kernel matrix bK = bL * cM * bU, where each factor is either extremely sparse or admits a fast recursive multiplication, yielding a kernel matrix-vector product (MVP) with complexity O(n^α * N̂) — linear in N̂ for fixed cut level α and grid density n, and polynomial O(D^α) in dimensionality.
The methodological core is a factorization of the complete-grid additive kernel matrix K into a product L * M * U (where L and U are lower/upper triangular Kronecker products built from the 1D lower/upper bidiagonal matrices, and M absorbs the base kernel evaluations). Because L and U are triangular and the incomplete-grid index set is CUD, Theorem 4.2 from Holzmüller and Pflüger guarantees that chopping commutes with the triangular factors. The resulting chopped factors bL, cM, bU are each computable in O(α * N̂), O(n * α * N̂), and O(α * N̂) respectively, giving an overall MVP in O(n * α * N̂). An exact GPR workflow — Conjugate Gradient solves for weights, stochastic trace estimation for the log-determinant, Adam-based hyperparameter optimization via the marginal log-likelihood — is then built on top of this fast MVP, with no approximation to the kernel matrix itself.
The method is validated on 24-dimensional potential energy surfaces (PESs) of ten organic molecules, using N̂ = 447,265 training points and 1,604,576 test points. Full GPR including hyperparameter optimization completes in 1.6–3.6 wall-clock hours on 36 CPU cores. Compared to the only available approximate competitor (SVGP with OAK kernel in GPyTorch/BoTorch), CUTS-GPR achieves lower MAX, RMSE, and MAE on every individual molecule while being comparably fast or faster, which the authors attribute to treating the kernel exactly rather than through variational inducing-point approximation.
Key findings
- Kernel MVP complexity is empirically O(D^2.06–2.08) for α=2 and O(D^4.21–4.29) for α=4 (theoretical asymptote O(D^α)), verified over D ranging from 16 to 16,384 with N̂ reaching several billion, completing in under 3 minutes on a single CPU core (Fig 2a).
- N̂-scaling of the MVP is empirically O(N̂^1.04) — essentially linear — confirmed by power-law fit for α=2, n=10 across N̂ from ~10³ to ~10⁹ (Fig 2b).
- Full GPR on 24-dimensional PES datasets (N̂=447,265 training, N̂*=1,604,576 test, α=3, OAK kernel ω=3) completes hyperparameter optimization in 1.62–3.58 h wall time (mean 2.16 h) on 36 cores of an AMD EPYC 9565, with predictions for all test points taking under 2 minutes.
- CUTS-GPR outperforms SVGP (up to 2,500 inducing points, ω=2 and ω=3) on all three error metrics (range-normalized MAX, RMSE, MAE) averaged over ten molecules; the gap is largest for MAX, where SVGP shows clear diminishing returns with additional inducing points (Fig 3).
- Stochastic trace estimates for most hyperparameter gradients achieve a standard error of the mean (SEM) below 1% using only 35 reparameterized Rademacher probe vectors and preconditioner rank k=10; exceptions are order variances σ²_0 (SEM 518%) and σ²_1 (SEM 37.6%), which the authors argue do not impair convergence due to Adam's robustness.
- Hyperparameter optimization convergence requires ~100 Adam iterations for 9 of 10 molecules; thioacetone is the outlier at 176 iterations (Fig 2c).
- The incomplete grid with maximum cut level α and n grid points per dimension has size N̂ = Σ_{k=0}^{α} C(D,k)(n-1)^k ≈ O(D^α * n^α / α!), growing polynomially rather than exponentially in D, enabling practical sampling in D=24 with α=3 at N̂≈447K.
- The quadratic term over all D length-scale hyperparameters is computed in O(D^α) rather than the naïve O(D^{α+1}) by reusing the factorized form of bK (Appendix G.5), maintaining consistent polynomial D-scaling throughout the entire training loop.
Methodology — deep read
THREAT MODEL AND ASSUMPTIONS: This is not a security paper; the 'adversary' is computational intractability. The key assumptions are: (1) training inputs lie on an incomplete grid whose mode-combination range (MCR) is closed under taking subsets (CUTS), equivalently the multi-index set is closed under decrements (CUD); (2) the kernel is additive in the form of eq. (9) — a weighted sum of products of 1D base kernels over subsets of dimensions up to interaction order ω; (3) noise is Gaussian and homoskedastic (fixed σ²=10⁻³ in the PES experiments, not optimized). The method is numerically exact under these assumptions — no low-rank approximation of the kernel matrix is made.
DATA: The PES application uses 24-dimensional potential energy surfaces of 10 small organic molecules (butadiene, DMSO, ethylamine, ethylene glycol, nitroethane, propanal, pyrazine, pyrrole, thioacetone, vinylformamide; listed in Table M1). Grid points per dimension are given in Table M2. The coarse grid (n=7 points per dimension) defines the training set via a cut-level-3 incomplete grid, yielding N̂=447,265 training points. The fine grid (n=11) defines the universe; training points are removed, leaving N̂*=1,604,576 test points. The datasets are released at an anonymized repository (https://anonymous.4open.science/r/24D_3M_PES_DATA-F32E). Both inputs and outputs are standardized to zero mean and unit variance before training. No cross-validation is reported; a single held-out test set is used for each molecule.
ARCHITECTURE / ALGORITHM — THE CORE CONTRIBUTION: The kernel is an orthogonal additive kernel (OAK, following Lu et al. 2022) with maximum interaction order ω=α=3 and 1D base kernels each parameterized by a length scale ℓ^(m). The OAK centering step (described in Appendix M.2) orthogonalizes the 1D kernels with respect to the marginal measure of the training data, ensuring model identifiability. The key novel component is the factorization bK = bL * cM * bU, derived in Section 5.1. Starting from the complete-grid additive kernel matrix K = L * M * U (eq. 18), where L = ⊗_m L^(m) and U = ⊗_m U^(m) are Kronecker products of lower/upper bidiagonal matrices arising from factoring the all-ones matrix J^(m) = L^(m) R^(m) U^(m), and M is the 'middle' factor absorbing base kernel evaluations, the authors apply Theorem 4.2 (Holzmüller & Pflüger 2021) three times: since L is lower triangular and U is upper triangular, and the index set is CUD, chopping (row/column selection to the incomplete grid) commutes with these factors. The resulting bL and bU are extremely sparse — each factor touches only a small subvector of length proportional to α — giving MVP cost O(α * N̂). The chopped middle factor cM is a sum of terms each containing a projection operator R[¬m] that forces most entries to zero; the resulting sparsity gives MVP cost O(n * α * N̂). The total MVP is O(n * α * N̂), which for fixed n and α is O(N̂) = O(D^α). An end-to-end concrete example: for thioacetone (D=24, α=3, n=7), N̂=447,265 and the MVP involves applying bU (cost ~O(3 * 447265)), then cM (cost ~O(7 * 3 * 447265)), then bL (cost ~O(3 * 447265)). This single MVP is then the building block for the conjugate gradient (CG) solver used to compute weights α = C⁻¹y and for stochastic log-determinant and trace estimation.
TRAINING REGIME: Hyperparameters are the order variances {σ²_k, k=0..3}, per-dimension length scales {ℓ^(m), m=1..24}, and noise σ² (fixed). Optimization uses Adam (Kingma & Ba 2015) with learning rate 0.1 and gradient norm convergence threshold ε_opt=10⁻³. Initial values are specified in Table M3 (not reproduced in the truncated text). The marginal log-likelihood gradient (eq. 4) requires a linear solve (CG with relative residual tolerance ε_CG=10⁻³, preconditioner rank k=10), a log-determinant estimate (35 reparameterized Rademacher probe vectors via stochastic trace estimation), and the quadratic term (computed via the factorized MVP to maintain O(D^α) D-cost). All CUTS-GPR runs use 36 cores of an AMD EPYC 9565 in standard double precision. No random seed strategy is documented.
EVALUATION PROTOCOL: The primary metrics are range-normalized MAX (maximum absolute error), RMSE, and MAE, averaged over the ten molecules. Figure 3 plots MAX and RMSE vs. number of SVGP inducing points (500, 1000, 1500, 2000, 2500) for SVGP with ω=2 and ω=3, compared to a single CUTS-GPR point (ω=3). Per-molecule breakdowns are in Appendix M.10. Learning curves (gradient norm vs. Adam iteration) are shown in Fig 2c. There is no statistical significance testing, no cross-validation, no distribution-shift evaluation (train/test are on the same functional but at different grid densities), and no ablation over α, ω, or n beyond the complexity benchmarks. The complexity benchmarks (Figs 2a, 2b) sweep D from 16 to 16,384 and N̂ from ~10³ to ~10⁹ with power-law fits.
REPRODUCIBILITY: PES datasets are released at the anonymized repository above. The paper is a preprint (arXiv:2605.08036v1, May 2026). Code is not yet explicitly described as released in the truncated text, though Appendix K describes the code structure. The SVGP baseline uses publicly available GPyTorch/BoTorch with a locally modified version to support ω=3 (not released). Full appendices (L, M) contain raw timings and detailed experimental settings.
Technical innovations
- The core contribution is a numerically exact kernel MVP with O(n^α * N̂) complexity for additive kernels on CUD incomplete grids, achieved by factorizing the complete-grid kernel matrix K = LMU and proving (via Holzmüller & Pflüger's Theorem 4.2) that chopping to the incomplete grid commutes with the triangular factors, yielding bK = bL * cM * bU where each factor is highly structured-sparse; prior Kronecker-based exact methods (Saatci 2011, Wilson et al. 2014, Gilboa et al. 2015) required complete grids with exponential N-scaling in D.
- Introduction of the CUTS (closed under taking subsets) / CUD (closed under decrements) incomplete grid framework for GPR, generalizing sparse-grid ideas from vibrational quantum chemistry (Holzmüller & Pflüger 2021) to machine learning regression, with grid size scaling as O(D^α * n^α) rather than O(n^D).
- A O(D^α)-cost computation of all D length-scale quadratic gradient terms (eq. 4) by exploiting the factorized bK form to avoid the naïve O(D^{α+1}) dot-product accumulation (Appendix G.5), maintaining consistent polynomial D-scaling throughout hyperparameter optimization.
- Practical demonstration that OAK centering (orthogonalization of 1D base kernels) and stochastic trace estimation with only 35 Rademacher probe vectors suffice for stable Adam convergence in exact 24-dimensional GPR, extending the OAK framework of Lu et al. (ICML 2022) from approximate (SVGP) to exact inference on structured grids.
- Identification of the 'accidental missing data' extension: a second layer of chopping (adding a ΓΓ̃ wrapper around the incomplete-grid MVP) can handle data points missing from the CUTS grid without restructuring the core algorithm, broadening applicability beyond strict grid datasets.
Datasets
- 24D PES of 10 organic molecules (butadiene, DMSO, ethylamine, ethylene glycol, nitroethane, propanal, pyrazine, pyrrole, thioacetone, vinylformamide) — training N̂=447,265 (coarse grid n=7, α=3), test N̂*=1,604,576 (fine grid n=11 minus training points) — released at https://anonymous.4open.science/r/24D_3M_PES_DATA-F32E (non-public at submission, anonymized repository)
- Synthetic scalability benchmarks — up to ~10⁹ data points, D up to 16,384, α=2,3,4, n=5,10,20 — generated algorithmically by the authors, no external source
Baselines vs proposed
- SVGP ω=2 (GPyTorch/BoTorch, 2500 inducing points): range-normalized MAX ≈ 0.8 (read from Fig 3) vs CUTS-GPR ω=3: range-normalized MAX ≈ 0.25 (approximate, averaged over 10 molecules)
- SVGP ω=3 (locally modified BoTorch, 2500 inducing points): range-normalized MAX ≈ 0.65 (approximate, Fig 3) vs CUTS-GPR ω=3: range-normalized MAX ≈ 0.25
- SVGP ω=2 (2500 inducing points): range-normalized RMSE ≈ 0.010 (approximate, Fig 3) vs CUTS-GPR ω=3: range-normalized RMSE ≈ 0.003
- SVGP ω=3 (2500 inducing points): range-normalized RMSE ≈ 0.008 (approximate, Fig 3) vs CUTS-GPR ω=3: range-normalized RMSE ≈ 0.003
- SVGP fastest runs (ω=2): wall time ≈ 40 min vs CUTS-GPR (ω=3): wall time 1.62–3.58 h (mean 2.16 h) — note CUTS-GPR is exact while SVGP is approximate
- SVGP ω=3: wall time > CUTS-GPR ω=3 in all cases (Appendix M.8, exact values not in truncated text)
Limitations
- Strict data-format requirement: training data must lie on a CUD incomplete grid; arbitrary scattered data cannot be used directly, though the authors acknowledge interpolation onto a grid (KISS-GP style) as a potential workaround with unclear accuracy implications.
- Predictive variance computation is identified by the authors as a current bottleneck for large test sets; only predictive means are computed in the PES experiments, with variance deferred to future work (planned LOVE integration).
- No adversarial or out-of-distribution evaluation: train and test sets differ only in grid density (n=7 vs n=11) for the same molecules; generalization to unseen molecular geometries or different functional forms is not assessed.
- Single application domain: all large-scale experiments are on computational chemistry PES data, which has a very regular grid structure by construction. Applicability to other high-dimensional scientific or ML datasets with less natural grid structure is asserted but not demonstrated.
- No comparison against other exact or approximate high-dimensional GP methods besides SVGP (e.g., SKIP, Ishida et al.'s hierarchical additive GP for complete grids, or deep kernel learning); the authors note that few methods can handle additive kernels with large N and D simultaneously, but the baseline set is narrow.
- Statistical rigor is limited: no confidence intervals or significance tests on RMSE/MAX comparisons across the 10 molecules; no ablation over key hyperparameters (cut level α, interaction order ω, preconditioner rank k, number of probe vectors) in the PES experiments.
- The empirical D-complexity exponents (e.g., 4.21–4.29 for α=4) exceed the theoretical asymptote of 4, and the authors acknowledge the asymptotic regime is not fully reached for α=4, leaving uncertainty about practical scaling at higher α values.
Open questions / follow-ons
- How does CUTS-GPR perform when the true function has significant high-order interaction structure (ω > α), and what is the modeling bias introduced by truncating at a given cut level α in such cases?
- Can the CUD-based MVP be extended to handle non-grid (scattered) data efficiently via adaptive interpolation onto an incomplete grid, and what approximation error does such interpolation introduce compared to exact SVGP?
- What is the practical upper limit on D and α before the O(D^α) polynomial scaling becomes the binding constraint rather than the O(N̂) linear term, and how does this interact with the number of hyperparameters (D length scales + α+1 order variances)?
- Can LOVE-based variance estimation (planned future work) be integrated without breaking the CUD structure, and what preconditioning strategies are needed to keep variance estimates accurate at the scale of millions of test points?
Why it matters for bot defense
At first glance, this paper is purely a computational mathematics contribution to GP scaling and has no direct application to bot defense or CAPTCHA. However, there are two indirect angles worth noting for a bot-defense ML engineer. First, CUTS-GPR enables exact Bayesian uncertainty quantification at scale in high-dimensional structured spaces. Bot-defense systems that rely on behavioral feature vectors (mouse dynamics, keystroke timing, sensor telemetry) sampled on quasi-regular grids — for example, binned time-series features across many behavioral channels — could in principle benefit from exact GP posteriors for anomaly scoring with calibrated confidence intervals, something that current approximate methods (sparse GPs, deep ensembles) handle poorly. If behavioral feature spaces are naturally low-interaction (i.e., most of the signal is in 1D and 2D marginals), an additive kernel with small ω would be well-specified and CUTS-GPR would be applicable.
Second, and more practically, CUTS-GPR's contribution to the Bayesian optimization (BO) literature is relevant: BO is increasingly used for hyperparameter tuning of CAPTCHA challenge difficulty and bot-detection threshold calibration in high-dimensional config spaces. Exact GP surrogates in BO yield better acquisition function quality than SVGP approximations, and CUTS-GPR's ability to handle D=24 exactly at N≈450K could make it competitive for BO in moderately high-dimensional tuning problems where evaluations are cheap enough to generate grid-structured data. A bot-defense engineer should treat this primarily as infrastructure-layer research — it does not change detection algorithms but could improve the reliability of any GP-based surrogate model used in the system, especially in offline analysis of behavioral feature spaces.
Cite
@article{arxiv2605_08036,
title={ Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids },
author={ Mads Greisen Højlund and August Smart Lykke-Møller and Henry Moss and Ove Christiansen },
journal={arXiv preprint arXiv:2605.08036},
year={ 2026 },
url={https://arxiv.org/abs/2605.08036}
}