Discovering a well-conditioned analytic continuation problem via dictionary learning

Source: arXiv:2606.19205 · Published 2026-06-17 · By Thomas Chuna, Phil-Alexander Hofmann, Alexander Benedix-Robles, Tobias Dornheim

TL;DR

The paper addresses the longstanding analytic continuation (AC) problem encountered in quantum Monte Carlo (QMC) simulations, where one seeks to reconstruct dynamic frequency-dependent spectral functions from imaginary-time correlation data. This inverse problem is notoriously ill-conditioned and ill-posed, causing instability and ambiguity in extracted spectral features. The authors propose a novel Regularized Stochastic Optimization Method (RSOM) that formulates analytic continuation as a dictionary learning problem. RSOM jointly discovers a sparse dictionary of Gaussian kernels parameterized by centers and widths, which encodes the spectral function compactly. This transforms the AC inverse problem from a high-dimensional, ill-conditioned space to a low-dimensional, well-conditioned one.

The method bridges gaps between traditional stochastic and regularized AC approaches and parameterized functional models by learning an adaptive representation rather than fixing the kernel dictionary a priori. RSOM uses nested optimization: an inner non-negative least squares (NNLS) fitting of coefficients given a dictionary and an outer stochastic Metropolis-Hastings sampling to optimize the dictionary parameters and dimensionality. Evaluations on a diverse set of synthetic benchmarks from physics (ρ-meson spectrum, double Gaussian, skewed Gaussian, and Gaussian smeared sinusoid) demonstrate RSOM achieves competitive or superior recovery of spectral features compared to state-of-the-art entropic regularization methods. Finally, application to authentic QMC data from the finite-temperature electron gas validates RSOM’s stability and physical plausibility, producing a well-conditioned discovered inverse problem as measured by low condition numbers. This work reveals dictionary learning as a promising new paradigm for analytic continuation, potentially addressing ill-conditioning by data-driven sparse parameterization.

Key findings

RSOM reduces the dimension of the inverse problem from the usual N_ω (frequency grid size) to a sparse Nc-dimensional space, where Nc ≪ N_ω.
For synthetic ρ-meson spectra, RSOM’s recovered peak positions and shapes are closer to the true solution than conventional entropic methods (Bryan’s algorithm and dual Newton optimizers).
On the double Gaussian test problem, RSOM greatly outperforms entropic methods in reconstructing multi-peak structures despite the ill-conditioning induced by Laplace smearing.
RSOM performs comparably with and sometimes better than established entropic approaches on skewed Gaussian and Gaussian smeared sinusoid problems, showing appropriate flexibility.
For authentic QMC electron gas data, RSOM produces stable spectral estimates, especially at large wave numbers q where entropic least squares with strong priors exhibit instability and overfitting (mean reduced chi-square ~10^-1 vs ~10^-6 for entropic).
The discovered regression matrix R = A·K has significantly lower condition numbers (e.g., near 1.0 to few hundreds) compared to the typical ill-conditioned problem where κ(A)=∞, showing RSOM learns well-conditioned inverse problems (Table 1).
Dimensionality scans reveal parsimony (few Gaussians Nc) is sufficient to fit data, indicating that sparsity functions as an effective regularizer without explicit prior bias.
RSOM requires no training data unlike neural network parameterizations and avoids biases introduced by fixed Bayesian priors by fitting the dictionary directly to observed data.

Threat model

The adversary is the intrinsic ill-conditioning and noise amplification in the analytic continuation inverse problem, which corrupt solutions and destabilize inversion. The method assumes only access to noisy imaginary-time correlation data and the known kernel transform, and cannot query additional prior information. The adversary is thus an algorithmic and statistical challenge of ill-posed inversion rather than an active malicious attacker.

Methodology — deep read

Threat Model & Assumptions: The problem adversary corresponds to the ill-posed nature of the AC inverse problem where noise and limited data cause exponentially ill-conditioned inversion. The method assumes access only to imaginary-time correlation data F(τ) and the kernel matrix A(τ,ω). The adversary is the ill-conditioning itself, not a malicious actor.
Data: -- Synthetic data sets include the ρ-meson spectral function from lattice QCD, double Gaussian sum, skewed Gaussian, and Gaussian smeared sinusoid signals characterized by known parameters. Noiseless signals are generated by applying known Laplace or generalized transforms, then Gaussian noise is added scaled by signal amplitude. -- Authentic data is imaginary-time correlation functions from finite-temperature electron gas QMC simulations with no exact spectral solution.
Architecture / Algorithm: -- The spectral solution S(ω) is parameterized as a non-negative linear combination of Nc Gaussian kernels: S(ω) = Σ_{j=1}^{Nc} c_j K_j(ω; μ_j, σ_j). -- The forward transform matrix R = A·K maps spectral coefficients c to correlation data F. -- The inner problem optimizes coefficients c with non-negative least squares to minimize χ² goodness-of-fit. -- The outer problem optimizes the dictionary size Nc and kernel parameters θ = (μ_j, σ_j) via stochastic gradient-informed Metropolis-Hastings sampling. -- A dimensionality scan identifies the smallest Nc where fitting is sufficient, enforcing sparsity (L0 norm) without explicitly solving NP-hard sparsity optimizations.
Training Regime: -- No classical training epochs; instead, stochastic outer optimization seeks minima over dictionary parameters. -- NNLS solves inner optimization efficiently. -- Multiple runs sample model space robustly; averaging accepted solutions produces final estimates.
Evaluation Protocol: -- Performance is evaluated by comparing recovered spectral functions to true known solutions. -- Metrics include visual reconstruction quality, chi-square goodness-of-fit in τ-space, and condition number κ(R) of the learned inverse problem. -- Baselines include Bryan’s MEM approximate optimizer, exact dual Newton optimizer, Backus-Gilbert methods, and entropic least squares with Bayesian priors. -- Cross-validation against held-out noise realizations confirms robustness.
Reproducibility: -- Code and examples for RSOM are publicly available. -- Datasets include both public synthetic benchmarks and closed authentic QMC data previously published.

Example: For the ρ-meson test, a noiseless spectral function composed of a Lorentzian peak plus continuum is Laplace transformed into imaginary-time data. RSOM learns Nc ~ 1 Gaussian kernel(s) with optimized location and width; NNLS fits coefficients; recovered spectrum is compared to ground truth and to entropic methods, showing improved peak resolution and lower reconstruction error. Condition numbers of R remain low, confirming well-conditioning.

Technical innovations

Formulating analytic continuation as a data-driven dictionary learning problem, discovering an adaptive sparse kernel dictionary rather than fixing it a priori.
Using a nested stochastic optimization combining non-negative least squares inner solves with gradient-informed Metropolis-Hastings outer sampling to jointly learn dictionary dimensionality and kernel parameters.
Applying dimensionality scans to impose parsimony (L0-type sparsity) as an implicit regularizer, balancing accuracy and conditioning without explicit L1 or Bayesian priors.
Demonstrating that learned Gaussian kernel dictionaries lead to well-conditioned inverse problems with dramatically lower condition numbers than standard discretized Laplace transforms.

Datasets

ρ-meson spectrum synthetic test — 100 noisy samples — generated from known analytical Lorentzian + continuum function
Double Gaussian synthetic test — 100 noisy samples — generated from sum of two Gaussians
Skewed Gaussian synthetic test — 100 noisy samples — constructed from PDF*CDF skewed Gaussian
Gaussian smeared sinusoidal synthetic test — 100 noisy samples — sinusoidal function with Gaussian blur
Electron gas QMC ITCF data — authentic quantum Monte Carlo simulations — proprietary dataset from prior literature [53]

Baselines vs proposed

Bryan MEM approximate optimizer: reduced chi-square χ² = 1.25 (double Gaussian), RSOM χ² = 0.92
Dual Newton entropic optimizer: reduced chi-square χ² = 1.52 (double Gaussian), RSOM χ² = 0.92
Backus-Gilbert method: reduced chi-square χ² = 4.42 (Gaussian smearing), RSOM χ² = 0.92
Gaussian-BG variant: reduced chi-square χ² = 2.57 (Gaussian smearing), RSOM χ² = 0.92
On ρ-meson problem, RSOM recovers peak positions more accurately compared to entropic methods (Fig. 1)
Electron gas QMC data: RSOM average reduced chi-square ~0.1 vs ~10^-6 for entropic, indicating less overfitting but improved stability at large wavenumber q

Limitations

RSOM’s stochastic optimization may be computationally expensive compared to purely deterministic solvers, especially for high Nc scans.
The method currently is limited to Gaussian kernel parameterizations; other kernel types might better capture certain spectral features (e.g., heavy-tailed Lorentzians).
Skewed or highly asymmetric spectral functions require multiple Gaussians; fitting accuracy can degrade compared to symmetric peaks.
Evaluation focuses on moderate noise levels; performance under high noise or adversarial noise conditions is not evaluated.
No formal theoretical guarantees on global optimality or convergence speed of the stochastic outer loop are provided.
Application to larger-scale QMC data or real-time extensions remain to be demonstrated.

Open questions / follow-ons

How to scale RSOM’s dictionary learning to larger frequency grids (N_ω) and higher dimensional QMC problems in practice?
Can other kernel forms or learned nonlinear dictionaries improve reconstruction over Gaussian mixtures for more complex spectral shapes?
What are theoretical convergence properties and statistical guarantees of the nested stochastic-dictionary optimization used?
How robust is RSOM to different noise models, including non-Gaussian or correlated noise typical in physical experiments?

Why it matters for bot defense

Analytic continuation embodies a classic ill-posed inverse problem characterized by severe ill-conditioning and noise sensitivity, analogous to challenges in bot-detection tasks where model inversion suffers from ambiguity. RSOM’s approach to learn a sparse, adaptive dictionary to reduce the dimensionality and improve conditioning while fitting observed data directly is a compelling strategy that CAPTCHA and bot detection engineers can adapt. Specifically, the nested stochastic optimization that discovers interpretable low-dimensional encodings could inspire designs of feature extraction or representation learning modules with inherent regularization via parsimony.

Moreover, RSOM’s demonstration that jointly learning basis functions from data leads to well-conditioned inverse models suggests that CAPTCHA solutions might benefit from dictionary or manifold learning reformulations rather than relying on fixed heuristics or black-box parameterizations. While direct analytic continuation is a physics-specific problem, the general principle—casting ill-conditioned inversion as a dictionary discovery problem coupled with sparse coding—provides a valuable conceptual lens for bot-defense researchers tackling analogous ill-posed classification or anomaly detection tasks under adversarial conditions.

Cite

bibtex

@article{arxiv2606_19205,
  title={ Discovering a well-conditioned analytic continuation problem via dictionary learning },
  author={ Thomas Chuna and Phil-Alexander Hofmann and Alexander Benedix-Robles and Tobias Dornheim },
  journal={arXiv preprint arXiv:2606.19205},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19205}
}

Discovering a well-conditioned analytic continuation problem via dictionary learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​