Valid Inference with Synthetic Data via Task Exchangeability

Source: arXiv:2606.13629 · Published 2026-06-11 · By Lezhi Tan, Tijana Zrnic

TL;DR

This paper addresses the fundamental challenge of performing statistically valid inference when only synthetic data are available for a target scientific task. Synthetic data, such as LLM-generated samples or generative-model outputs, are increasingly used in disciplines like social sciences, AI evaluation, and biology. However, they can be biased, noisy, or otherwise misaligned with real data distributions, raising doubts about the validity of classical inference procedures that treat synthetic data as real. The authors introduce a novel statistical framework based on the concept of "task exchangeability," which leverages historical tasks with both real and synthetic data to calibrate the discrepancy between synthetic and real data inference on a new, unseen task. By defining the target task as exchangeable with these historical tasks in a formal sense, the framework uses historical real–synthetic gaps to adjust naive synthetic-data confidence intervals, providing rigorous coverage guarantees. The paper develops algorithms for scalar and multidimensional estimands along with extensions beyond exact exchangeability and for finite-sample inference.

Empirically, the approach is demonstrated on data from the American National Election Studies (ANES) survey, where the naive synthetic-only intervals severely underestimate uncertainty and fail to cover true values, while the task-exchangeability-based intervals successfully cover at prescribed rates by accounting for synthetic bias estimated from previous related survey waves. The framework is broadly applicable to scientific domains relying on synthetic data and addresses a critical inferential gap in the growing practice of synthetic-data-driven research.

Key findings

Under task exchangeability, confidence intervals for target estimands constructed by inflating synthetic-data intervals using historical real–synthetic gaps achieve coverage ≥ 1 - (α1 + α2 + α3) (Theorem 1).
Empirical evaluation on 33 ANES feeling-thermometer tasks shows naive synthetic-only intervals are too narrow with substantial bias, while task-exchangeability intervals have valid coverage at nominal levels (Fig 1).
A weighted variant allows relaxing exact exchangeability assumptions, with coverage degrading gracefully bounded by total variation distances between weighted historical and target tasks (Theorem 2).
Algorithms extend to multidimensional inference targets via convex confidence regions and nested families, still retaining rigorous finite-sample guarantees (Theorem 3).
Inference targeting the finite-sample task estimand (not population-level) admits a simpler exact coverage procedure requiring only observed real and synthetic data differences from historical tasks (Algorithm 4, Theorem 4).
Bias-corrected point estimators computed by averaging historical real-synthetic estimation gaps provide unbiased estimates for the finite-sample target.
The method requires no real data from the target task but relies on having multiple historical tasks with paired real and synthetic data to learn calibration corrections.
Coverage guarantees degrade only additively with three error levels α1, α2, α3 associated with synthetic-data inference, gap estimation, and quantile calibration, respectively.

Methodology — deep read

Threat model & assumptions: The adversary is not explicitly modeled since this is a statistical inference framework rather than a security mechanism. The core assumption is that the target task and multiple historical tasks are drawn exchangeably from a meta-distribution over tasks and datasets (Assumption 1). Each task is defined by a functional θ and an underlying data distribution P, with the target θ(P). The historical tasks each have associated real datasets (Sj) and synthetic datasets (˜Sj). Synthetic datasets are generated from the same black-box synthetic data generator G instantiated per task. The key difficulty is that no real data are available for the target task; only synthetic data are. The goal is to infer the target estimand θ∗ for the real distribution P∗.
Data: The methodology requires a set of historical tasks T1,...,TT, each with real data Sj (size nj) and synthetic data ˜Sj (size N). For the current target TT+1 = T∗, only synthetic data ˜S are drawn. The authors demonstrate on 33 tasks from the ANES survey data, using GPT-based synthetic respondents for synthetic data generation.
Architecture / algorithm: The main algorithm (Algorithm 1) proceeds as follows:

Draw synthetic samples for the current task and each historical task.
Compute a confidence interval [˜L, ˜U] for the synthetic-data target estimate ˜θ.
For each historical task j, compute confidence intervals [ˆ∆Lj, ˆ∆Uj] for the real–synthetic gap ∆j = θj(Pj) − θj(˜Pj) using paired real and synthetic data.
Use the empirical distribution of historical gaps to estimate a quantile ˆ∆ that bounds the unknown target gap ∆T+1.
Expand the synthetic confidence interval by ˆ∆ to obtain a valid interval for θ∗. This uses classical inference procedures (Assumption 2) for single-distribution confidence intervals and two-sample gap confidence intervals.

Training regime: Not applicable as this is a statistical inference framework, not a learned model.
Evaluation protocol: The coverage is theoretically guaranteed under exchangeability assumptions, with error budget split across synthetic-data inference (α1), gap estimation (α2), and quantile calibration (α3). Empirical evaluation uses held-out ANES survey waves as target tasks and preceding waves as historical tasks. The authors compare naive synthetic-only intervals versus task-exchangeability intervals in coverage of true (real) estimands.
Reproducibility: The authors provide detailed algorithms and theoretical proofs. The ANES data is public. Synthetic samples are generated by prompting GPT models. Code or weights are not explicitly mentioned, so full exact reproducibility depends on access to the synthetic data generator and historical data.

End-to-end example: Suppose we want to infer average opinion θ∗ in a survey subgroup with no real respondents but synthetic ones from a model. We obtain ˜θ from synthetic data and form standard confidence interval [˜L, ˜U]. From previous survey waves (historical tasks), we compute gaps ∆j between real and synthetic estimates with confidence intervals [ˆ∆Lj, ˆ∆Uj]. Using the empirical quantile of these gaps conservatively bounds the unknown target gap ∆T+1. Expanding [˜L, ˜U] by this quantile yields a new interval that covers the true average opinion with high probability under task exchangeability. Empirically this expansion corrects for synthetic data bias and variance, as seen in the ANES experiments.

Technical innovations

Introduction of task exchangeability, a formal statistical assumption relating a current target task to historical tasks to enable valid calibration of synthetic data errors without real target data.
Algorithmic framework combining synthetic-data inference with empirical real-synthetic gap calibration from historical tasks to form confidence intervals with formal coverage guarantees.
Extension to approximate exchangeability via weighted quantile calibration with coverage degradation explicitly bounded by total variation distances.
Multidimensional extension using nested convex confidence regions and Minkowski sums to handle vector-valued target parameters with rigorous guarantees.
Simplified exact coverage procedure targeting finite-sample estimands using only observed gap estimates without requiring confidence intervals on gaps.

Datasets

American National Election Studies (ANES) feeling thermometer tasks — 33 historical tasks from prior survey waves — public data source

Baselines vs proposed

Naive synthetic-only intervals: coverage significantly below nominal levels and intervals too narrow versus proposed task exchangeability intervals: valid coverage at prescribed 85% level (Fig 1)

Limitations

Strong dependence on the task exchangeability assumption, which may not hold perfectly in real-world settings and requires carefully chosen historical tasks.
Requires access to multiple historical tasks with both real and synthetic data; in domains lacking sufficient historical tasks, the method may fail or be overly conservative.
Synthetic data distribution ˜P is assumed known only via samples; if synthetic generator distribution shifts over time, exchangeability assumptions degrade.
No explicit adversarial robustness analysis; framework assumes synthetic data does not adversarially manipulate inference.
The method relies on classical confidence intervals which may be less accurate for complex or high-dimensional estimands.
Reproducibility depends on access and stability of the synthetic data generator G and historical datasets, which may be proprietary or restricted.

Open questions / follow-ons

How to automatically select or weight historical tasks to optimize relevance under unknown or partial exchangeability?
Can task exchangeability be empirically tested or quantified before applying the method?
Extension to dynamic or evolving synthetic data generators that change distribution over time.
Incorporating private or restricted historical data scenarios where full real data may not be accessible.

Why it matters for bot defense

While not directly about bot-detection or CAPTCHAs, this work provides a rigorous framework for valid statistical inference when only synthetic data are available for a target task, a scenario increasingly relevant in AI evaluation contexts where synthetic samples or LLM-generated assessments are used. Bot-defense engineers can leverage the task exchangeability principle to calibrate and quantify uncertainty in metrics derived from synthetic data outputs of AI models. This approach helps avoid naive overconfidence in synthetic-only evaluations, enabling more reliable and principled interpretation of synthetic data within automated evaluation pipelines. Its extensions to approximate exchangeability and multidimensional outputs are particularly pertinent for complex evaluation metrics common in AI and bot-defense research. Understanding and applying these principles can improve the scientific rigor of synthetic-data-based assessments, including those used to benchmark or monitor automated systems.

Cite

bibtex

@article{arxiv2606_13629,
  title={ Valid Inference with Synthetic Data via Task Exchangeability },
  author={ Lezhi Tan and Tijana Zrnic },
  journal={arXiv preprint arXiv:2606.13629},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13629}
}

Valid Inference with Synthetic Data via Task Exchangeability ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​