Geometry-Aware Dataset Condensation for Diffusion Model Training
Source: arXiv:2606.05883 · Published 2026-06-04 · By Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li
TL;DR
This paper addresses the problem of dataset condensation specifically tailored for diffusion model training. Existing condensation approaches either synthesize low-fidelity samples that degrade diffusion training or rely on subset selection methods that fail to preserve the intrinsic geometric and distributional structure needed for effective likelihood-based diffusion objectives. To overcome these challenges, the authors reinterpret real subset selection as a geometry-aware distribution alignment task, employing one-sided partial optimal transport (POT) to selectively align a small subset with the full data distribution while allowing unmatched mass in low-density regions. Complementary lightweight feature-statistics and semantic consistency regularizers are incorporated to ensure distributional fidelity. A novel two-stage discrete optimization strategy—a greedy selection followed by swap-based refinement—efficiently optimizes this objective.
Extensive experiments on ImageNet subsets across three data budgets (10K, 50K, 100K) and multiple diffusion model variants (DiT-L/2, SiT-L/2) demonstrate that their method consistently achieves superior generative performance, measured by FID, Inception Score, Precision, and Recall, compared to state-of-the-art real subset selection baselines including D2C. The approach generalizes robustly across image resolutions (256×256 and 512×512) and diffusion formulations, evidencing geometry-aware subset selection as an effective, data-centric means to accelerate and improve diffusion model training. Open-source code accompanies the work.
Key findings
- The proposed one-sided partial optimal transport (POT) formulation enables selective alignment of compact subsets to the core high-density regions of the full data distribution, improving geometric fidelity in condensed data.
- Using geometry-aware POT plus mean-variance and confidence regularization, the method reduces FID by over 1 point compared to D2C on ImageNet 256×256 with 0.8% data budget (3.43 vs 4.20) after 100K training iterations (Table 1 & 2).
- At 4.0% data budget on ImageNet 256×256, their method achieves an FID of 11.01, improving on D2C’s 14.81 and other baselines (Table 3), showing better fidelity with larger subsets.
- The two-stage optimization (greedy selection + swap refinement) corrects early suboptimal choices, yielding subsets with superior distributional coverage and diffusion training performance.
- Their approach generalizes across diffusion variants, improving FID from 15.01 (D2C) to 8.83 on SiT-L/2 at 8.0% budget and 100K iterations (Table 5 & 6).
- The method is robust to higher image resolutions, reducing FID from 14.8 (D2C) to 6.17 on ImageNet 512×512 at 0.8% budget (Table 4).
- Distributional fidelity regularizers stabilize condensation by preserving global feature mean and variance and semantic reliability, critical for diffusion likelihood objectives.
- Experiments fixing training iterations show that geometry-aware condensation accelerates convergence by better focusing training on a consistent core manifold.
Threat model
n/a — This paper addresses data efficiency and condensation for diffusion model training rather than security adversaries. The 'threat' can be interpreted as the challenge of preserving diffusion training fidelity despite aggressive dataset condensation under limited sample budgets.
Methodology — deep read
Threat model and assumptions: The paper addresses a data efficiency and condensation problem targeting diffusion model training, where the adversary is not a traditional attacker but the challenge is to preserve faithful generative data distributions under extreme subsampling constraints. The goal is to select a small subset preserving the geometry of the full dataset in a learned representation space for diffusion training governed by likelihood objectives.
Data provenance, size, labels, splits: Experiments use ImageNet-1K (Russakovsky et al., 2015) datasets at various subset sizes: 10K (0.8%), 50K (4%), and 100K (8%) samples. Images are center-cropped and resized to 256×256 or 512×512 following DiT/ADM preprocessing pipelines.
Architecture/algorithm:
- Feature extraction produces embeddings for all candidates.
- Formulates subset selection as discrete optimization of alignment loss combining (a) one-sided partial optimal transport (POT) cost, allowing unmatched target mass to focus on high-density regions, (b) mean-variance feature statistics regularizer, and (c) confidence regularization enforcing semantic consistency.
- Uses the dummy-source formulation with entropic Sinkhorn iterations to efficiently compute POT transport plans and costs.
Training regime:
- Subsets constructed per class to maintain balance.
- Two-stage discrete solver:
- Stage I: Geometry-guided greedy incremental selection minimizing marginal gain in total loss.
- Stage II: Swap-based refinement evaluates pairwise swaps between selected and unselected points, accepting swaps reducing total loss until local optimum is reached.
- Optimization hyperparameters: Entropic regularization weight, POT capacity coefficient (κ), and weighting parameters α and β balance the loss terms.
Evaluation protocol:
- Diffusion models DiT-L/2 and SiT-L/2 are trained from scratch for 100K iterations on selected subsets.
- Metrics: FID (image fidelity/diversity), Inception Score (semantic clarity), and Precision/Recall (fidelity and coverage).
- Baselines: Random, Herding, K-Center, CCS, DQ, and D2C.
- Evaluated on fixed training iterations to contrast condensation selection efficiency.
- Tests conducted across multiple data budgets (0.8%, 4%, 8%), image resolutions (256×256, 512×512), and diffusion variants for robustness.
Reproducibility:
- Code released at https://github.com/2018cx/GADC.
- Uses publicly available ImageNet-1K dataset.
- Pseudocode and convergence analyses provided in appendices.
End-to-end example:
- For a 10K subset from ImageNet, extract features from all 100K+ images.
- Initialize empty subset per class.
- Greedily select samples one-by-one by marginally minimizing combined POT + statistics + confidence loss.
- Refine subset via swaps allowing improved alignment.
- Train diffusion model DiT-L/2 on condensed 10K selection for 100K steps.
- Evaluate generated samples via FID, IS, Precision, Recall to confirm fidelity and distribution coverage improvements over baselines.
- Repeat for larger subsets and resolutions.
Technical innovations
- Reformulating real subset selection as a one-sided partial optimal transport (POT) alignment problem with a capacity constraint to focus on core data manifold for better diffusion training fidelity.
- Introducing lightweight statistical (mean-variance) and semantic confidence regularization to complement geometric alignment for improved distributional fidelity.
- Developing a two-stage discrete optimization algorithm combining geometry-guided greedy construction and swap-based refinement to optimize discrete subset selection efficiently.
- Applying dummy-source augmented cost matrices and entropic Sinkhorn iterations for scalable, differentiable one-sided POT computation on large datasets.
Datasets
- ImageNet-1K — ~1.28 million images — publicly available
Baselines vs proposed
- Random (0.8% data budget, 10K subset, DiT-L/2, 100K iterations): FID = 35.86 vs proposed: 3.43
- K-Center (0.8% budget): FID = 50.77 vs proposed: 3.43
- Herding (0.8% budget): FID = 40.75 vs proposed: 3.43
- D2C (0.8% budget): FID = 4.20 vs proposed: 3.43
- D2C (4.0% budget): FID = 14.81 vs proposed: 11.01
- D2C (8.0% budget): FID = 22.55 vs proposed: 17.09
- At 512×512 resolution (0.8% budget): D2C FID = 14.8 vs proposed: 6.17
- SiT-L/2 (8.0% budget): D2C FID = 15.01 vs proposed: 8.83
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.05883.

Fig 1: Comparison of dataset condensation methods. (a) Syn-

Fig 2 (page 1).

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- Experiments limited to ImageNet-1K, potential transferability to other domains or modalities not studied.
- No adversarial or robustness evaluation against distribution shifts beyond the training data domain.
- Optimization involves greedy and swap heuristics, which may not yield globally optimal subsets in combinatorial space.
- Evaluation fixed at 100K training iterations; real-world scenarios with longer convergence not fully explored.
- The approach depends on quality and representativeness of learned feature embeddings, which may vary with encoder choice.
- Synthetic data synthesis methods were not the main focus; integration or hybrid approaches could be further explored.
Open questions / follow-ons
- How does the method perform on datasets beyond ImageNet, particularly those with different data modalities or distributions?
- Can the geometry-aware condensation framework be extended to dynamic or streaming data selection scenarios?
- What are the theoretical guarantees or bounds for optimality in the discrete subset optimization with one-sided POT?
- How do different feature extraction backbones or self-supervised embeddings affect condensation quality and diffusion training?
Why it matters for bot defense
For bot-defense or CAPTCHA systems considering generative approaches or adversarial data modeling, this paper demonstrates how to effectively condense large datasets into smaller representative subsets that preserve critical distributional structure for diffusion model training. This has direct implications for improving training efficiency without sacrificing model fidelity, especially when large-scale high-quality data is costly to collect or store. The geometry-aware distribution alignment approach offers a principled way to select real data points that support robust generative behavior, which could be leveraged in scenarios requiring high-fidelity synthetic data generation for bot detection or CAPTCHA challenge creation. Moreover, the discrete optimization framework can inspire dataset pruning/curation strategies where maintenance of generative model quality is crucial despite stringent compute or storage constraints.
Cite
@article{arxiv2606_05883,
title={ Geometry-Aware Dataset Condensation for Diffusion Model Training },
author={ Xiao Cui and Yulei Qin and Mo Zhu and Wengang Zhou and Hongsheng Li and Houqiang Li },
journal={arXiv preprint arXiv:2606.05883},
year={ 2026 },
url={https://arxiv.org/abs/2606.05883}
}