UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition

Source: arXiv:2502.19803 · Published 2025-02-27 · By Xiao Lin, Yuge Huang, Jianqing Xu, Yuxi Mi, Shuigeng Zhou, Shouhong Ding

TL;DR

This paper addresses the challenge of generating synthetic face datasets with high intra-class diversity while preserving identity, to train face recognition (FR) models without relying on privacy-sensitive real data. Existing synthetic face generation methods conditioned on identity context often overfit these contexts, producing images with low intra-class variation and thus reducing FR training effectiveness. UIFace introduces a diffusion-based framework that leverages both identity-conditioned and unconditional generation by using a learnable empty context alongside identity contexts. A novel two-stage sampling strategy is proposed that first leverages the empty context to increase diversity and then conditions on identity to restore identity-relevant details. An attention injection module further guides the identity-conditioned generation using attention maps from the empty-context branch. The method is trained on the CASIA-WebFace dataset and evaluated by training FR models on UIFace-generated synthetic datasets and testing on common benchmarks. Experimental results demonstrate UIFace significantly outperforms prior synthetic-based approaches in face verification accuracy across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets, even when using less training data or fewer synthetic identities. When scaling up synthetic identities, UIFace reaches performance comparable to models trained on real data. The approach enhances intra-class diversity (measured by LPIPS and Improved Recall metrics) while balancing identity consistency. Overall, UIFace advances synthetic face data generation towards practical privacy-preserving FR training with diverse and identity-consistent samples.

Key findings

UIFace achieves 93.27% average face verification accuracy on five real benchmarks with only 0.5M synthetic images (10k IDs × 50 images each), surpassing the previous best synthetic method CemiFace (92.30%) and real CASIA-WebFace baseline (95.05%).
Increasing synthetic dataset size to 1.5M images (30k IDs × 50) further boosts average accuracy to 94.54%, outperforming state-of-the-art Arc2Face (93.14%) and nearing real data performance.
Intra-class diversity metrics LPIPS and Improved Recall increase by over 5% compared to baseline diffusion models without two-stage sampling and attention injection (LPIPS from 0.5270 to 0.5592, Improved Recall from 53.99 to 71.96).
Two-stage sampling with adaptive partitioning improves average accuracy from 91.53% (baseline) to 92.70%, and further with attention injection to 93.27%.
Adaptive determination of the two-stage partition boundary per sample outperforms fixed boundaries by ~0.6% mean accuracy.
Attention injection from empty context maps yields higher-quality images and better identity preservation compared to naive attention map replacement (as validated qualitatively and quantitatively).
UIFace improvements generalize across different diffusion sampling algorithms, improving accuracy by ~1.7% over baseline DDIM and FPNDM samplers.
Results confirm that early sampling steps restore diverse attributes (pose, illumination) while later steps restore identity, motivating the two-stage design.

Threat model

n/a — This paper focuses on synthetic data generation for face recognition training to address privacy and dataset limitations rather than adversarial threats.

Methodology — deep read

Threat model and assumptions: The study assumes an adversary none, as the work focuses on synthetic data generation to avoid privacy issues inherent in real face datasets. The goal is enhancing synthetic data quality to train FR models. The identity context is extracted by a pretrained FR model and used to condition sampling.
Data: Training uses CASIA-WebFace, a public dataset of 500k images over ~10.5k celebrities, with natural variations. The synthetic datasets generated vary in size from 0.5M to 1.5M images across 10k to 30k synthetic identities. Testing is done on five standard real datasets (LFW, CFP-FP, CPLFW, AgeDB, CALFW) to evaluate model generalization.
Architecture/Algorithm: The core is a Latent Diffusion Model (Rombach et al. 2022) with a UNet backbone. Inputs are latent representations of faces. Conditions are identity contexts extracted via a pretrained FR model, and an additional learnable empty context is introduced to condition sampling without explicit identity. During training, 20% of conditions are replaced randomly by the empty context, encouraging the model to learn identity-agnostic generation.

A novel two-stage sampling strategy is proposed for inference: in early timesteps (high t), sampling is conditioned on the empty context to promote intra-class diversity in identity-irrelevant features. In later timesteps (low t), sampling switches to the given identity context to restore identity details. The boundary timestep t0 is adaptively determined per sample based on temporal changes in cross-attention maps between UNet features and contexts.

An attention injection module is further introduced, whereby cross-attention maps from the empty-context generation are normalized and injected into the identity-conditioned branch during the later stage, augmenting diversity while maintaining identity. Self-attention maps are replaced directly, as they affect identity-irrelevant attributes.

Training regime: Diffusion model trained for 250k iterations with batch size 64 via Adam optimizer (lr=1e-4). Classifier-free guidance scale w=1. Synthetic FR models trained with IR50 backbone and ArcFace loss for 40 epochs using the generated synthetic datasets.
Evaluation protocol: Face verification accuracy is evaluated on standard datasets with verification metrics. Diversity of synthetic data is measured by LPIPS and Improved Recall. Ablations evaluate components like fixed vs adaptive stage partition, and attention injection. Comparisons made against recent baselines such as SynFace, DigiFace, DCFace, IDiff-Face, CemiFace, Arc2Face.
Reproducibility: The authors plan public code release. Datasets and pretrained FR models referenced are public. Some dataset building processes (filtering similar synthetic identities) follow prior works (Kim et al. 2023). Detailed hyperparameters and architectures are shared.

Concrete example: For one synthetic face identity, the model first samples using empty context up to timestep t0 (e.g. identified adaptively per sample) to generate diverse pose and illumination variations, then from t0 to 0 it conditions on the specific identity embedding, injecting attention maps from empty context generation to guide the denoising. The resulting image preserves identity but exhibits diverse intra-class attributes. This synthetic data is used to train FR backbones, which demonstrate improved accuracy on real benchmarks.

Technical innovations

Introduction of a learnable empty context in conditional diffusion to enable intra-class diverse unconditional generation alongside identity-conditioned generation.
Novel two-stage diffusion sampling strategy that conditions on empty context in early steps and identity context in later steps to balance diversity and identity preservation.
Adaptive partitioning algorithm that dynamically determines the boundary between the two sampling stages per image based on temporal changes in cross-attention maps.
Attention injection module that utilizes normalized cross-attention maps from empty-context generation to guide identity-conditioned denoising at later sampling steps, enhancing diversity without degrading identity.

Datasets

CASIA-WebFace — ~0.5M images over 10,575 identities — public
LFW, CFP-FP, CPLFW, AgeDB, CALFW — standard FR benchmarking datasets — public

Baselines vs proposed

CASIA-Real (0.5M images): Average accuracy = 95.05% vs UIFace (0.5M images): 93.27%
SynFace (0.5M): 74.75% vs UIFace (0.5M): 93.27%
DigiFace (0.5M): 83.45% vs UIFace (0.5M): 93.27%
DCFace (0.5M): 89.56% vs UIFace (0.5M): 93.27%
CemiFace (0.5M): 92.30% vs UIFace (0.5M): 93.27%
Arc2Face (0.5M): 91.73% vs UIFace (0.5M): 93.27%
CemiFace (1.0M): 93.07% vs UIFace (1.0M): 94.06%
Arc2Face (1.2M): 93.14% vs UIFace (1.0M): 94.06%
Baseline diffusion model (single-stage): Average accuracy = 91.53% vs two-stage adaptive + attention: 93.27%
Baseline DDIM sampler: 91.53% vs UIFace DDIM: 93.27%
Baseline FPNDM sampler: 92.00% vs UIFace FPNDM: 93.59%

Limitations

Synthetic data generation and FR model evaluation limited to CASIA-WebFace for training diffusion model, lacking tests on other diverse or larger real datasets.
While intra-class diversity is improved, there remains an inherent tradeoff with identity consistency; perfect preservation is not guaranteed.
Attention injection method depends on heuristics for normalization; impact on identity could vary for different model architectures.
No adversarial or security-specific threat evaluation to test synthetic data robustness against spoofing or attack scenarios.
Performance and diversity improvements are demonstrated mostly on standard closed-set benchmarks; open-set or deployment scenarios are not explored.
Method relies on pretrained identity extractor; quality depends on this upstream FR model, which may bias generation.

Open questions / follow-ons

How well does the two-stage sampling method generalize when applied to other generative models beyond latent diffusion UNets?
Can the adaptive partitioning and attention injection modules be optimized further to dynamically trade off diversity and identity preservation for different applications?
How does UIFace-generated data perform in open-set FR problems or under domain shifts compared to real data?
Can the framework be extended to condition on other attributes (e.g. expression, pose) explicitly to further control intra-class variations?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, UIFace provides useful insights on leveraging generative latent diffusion models to produce synthetic face datasets with enhanced intra-class diversity without identity leakage. This can help generate large-scale training data for robust face-based authentication and attack detection systems without privacy concerns of real face datasets. The two-stage sampling technique balancing diversity and identity features may inspire synthetic data generation for other sensitive biometric modalities. The proposed attention injection and adaptive partitioning strategies highlight novel mechanisms to improve generative diversity without sacrificing critical identity consistency, which is a common challenge for synthetic biometric sample production. However, caution is warranted as no analysis of adversarial robustness or spoofing resistance is presented. Overall, UIFace contributes an important step toward realistic, scalable synthetic biometric data generation, key for advancing privacy-aware face recognition and bot-defense solutions relying on face biometrics.

Cite

bibtex

@article{arxiv2502_19803,
  title={ UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition },
  author={ Xiao Lin and Yuge Huang and Jianqing Xu and Yuxi Mi and Shuigeng Zhou and Shouhong Ding },
  journal={arXiv preprint arXiv:2502.19803},
  year={ 2025 },
  url={https://arxiv.org/abs/2502.19803}
}

UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​