Skip to content

Vec2Face+ for Face Dataset Generation

Source: arXiv:2507.17192 · Published 2025-07-23 · By Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer

TL;DR

Vec2Face+ is a synthetic face-dataset generation method aimed at face recognition training, with the central claim that the remaining gap between synthetic and real training data is not just about attribute diversity or inter-class separation, but also about preserving identity consistency within each synthetic identity folder. The paper revisits prior synthetic datasets and shows that, even when attribute variation is comparable to or higher than CASIA-WebFace, synthetic sets can still underperform because images grouped as one identity are often not consistently recognized as the same person by a face recognition (FR) model.

The method extends the earlier Vec2Face paradigm by generating images directly from FR feature vectors, then adds three generation strategies: sampling vectors that are sufficiently dissimilar to create well-separated identities, AttrOP for iterative attribute control, and a LoRA-based pose-control path that can produce large head-pose variation much faster and with better identity preservation than AttrOP. Empirically, the authors report that VFace10K reaches 93.89% average accuracy on five standard real-world test sets, and that scaling to VFace100K and VFace300K yields 94.88% and 94.93%, both above CASIA-WebFace at 94.79% on the same five-test-set average. They also report stronger results on seven test sets for VFace10K, but note important weaknesses: synthetic identities still struggle badly on identical-twins verification and show larger demographic bias than real-data training.

Key findings

  • VFace10K reaches 93.89% average accuracy on five standard real-world test sets (LFW, CFP-FP, AgeDB-30, CALFW, CPLFW), which the paper says is 1.59 percentage points above the second-best synthetic training set of the same size.
  • VFace100K and VFace300K reach 94.88% and 94.93% average accuracy on the same five test sets, exceeding CASIA-WebFace at 94.79% for the first reported synthetic-dataset win over that real training set on average accuracy.
  • The paper attributes the residual synthetic-vs-real gap to intra-class identity inconsistency rather than attribute variation: Fig. 3 shows attribute variation is roughly on par with CASIA-WebFace, while Fig. 6 shows synthetic datasets have lower intra-class consistency.
  • HSFace10K/VFace (Vec2Face-family) achieve inter-class separability comparable to CASIA-WebFace in Fig. 4, yet still trail by 2.79% average accuracy, implying separability alone is insufficient.
  • LoRA pose control generates 200K images in under 30 minutes on an NVIDIA L40S, versus more than 20 hours with AttrOP for the same scale, i.e. over 40x faster.
  • Dropping the patch-based discriminator from Vec2Face reduces training time by 20% while the authors report no degradation in image quality.
  • Only 1 out of 11 synthetic datasets beats random guessing (50%) on identical-twins verification, despite some of those datasets performing well on standard FR benchmarks.
  • Models trained on synthetic identities exhibit larger demographic disparity than models trained on real identities, according to the paper’s additional bias evaluation set.

Methodology — deep read

The adversary model is implicit because this is primarily a dataset-generation / FR-training paper rather than a direct attack paper. The system assumes an offline dataset construction setting: the generator can use pretrained face-recognition, pose, quality, expression, and attribute estimators; the goal is to synthesize labeled face images that train a downstream FR model. There is no explicit attacker with query access to the generator in the main framing, but the design is motivated by privacy-preserving substitution for web-scraped real faces. The key assumption is that continuous FR embeddings encode identity well enough that if the generator learns to map from feature space to images faithfully, then sampled vectors can serve as identity seeds and nearby perturbed vectors can preserve identity while varying attributes.

Data provenance is a mix of public real-image datasets used for training the generator, for attribute estimation, and for evaluation. The paper analyzes five synthetic datasets plus CASIA-WebFace in Fig. 3–6: DigiFace, SFace, IDiff-Face, DCFace, HSFace10K (Vec2Face), and CASIA-WebFace. For attribute analysis, the authors extract labels for nine facial attributes using pretrained tools: brightness via BiSeNet face parsing plus the Face Skin Brightness metric, quality via MagFace feature magnitude, age via a 0–100 age classifier, yaw via img2pose, expression via DDAMFN on AffectNet, beard/baldness/mustache via LogicNet trained on FH41K, and eyeglasses via an eyeglasses classifier. They discard images with segmented skin under 20% of the image area for brightness measurement, dropping less than 7% of images. For some attribute analyses they also filter out female images when studying beard/baldness/mustache to avoid confounding. The exact training split for Vec2Face+ itself is not fully spelled out in the provided text, but the paper states that the system generates VFace datasets at multiple scales: VFace10K, VFace100K, VFace300K, and mentions 4M/12M-image variants in the abstract and intro. The main generation process uses identity vectors sampled from a Gaussian and perturbed vectors sampled from Gaussian noise with sigma values 0.3, 0.5, and 0.7; the resulting images are accepted only if FR similarity and quality constraints are met.

Architecturally, Vec2Face+ has two components: a main image-generation model and a pose-control model. The main model maps a 512-D FR feature vector into a 2-D feature map of shape (N, 49, 768) using two linear layers so it matches a ViT-B-like token layout and can be decoded back to a 112×112×3 image by a 4-layer decoder. The core novelty relative to Vec2Face is the feature masked autoencoder (fMAE): instead of masking patches of pixels, it randomly drops entire rows in the feature map before encoding, with the masked row proportion drawn from a truncated distribution over [0.5, 1.0] with mean 0.75. After encoding, the model fills in the condition at the masked positions. Training supervision is reconstruction-based: pixel L2 reconstruction loss, FR embedding cosine-distance loss between generated and ground-truth images, and LPIPS-style perceptual loss with a VGG backbone. The total loss is Lrec + Lid + 0.2*Llpips. A notable simplification is that Vec2Face+ removes the patch-based discriminator used in Vec2Face because the authors found it only marginally improved image quality while adding substantial compute; they report a 20% training-time reduction.

The pose-control branch is a separate fine-tuning path built around a frozen main model, a 4-layer CNN that processes a face-landmark image, and LoRA adapters. The landmark image serves as an explicit pose condition inside fMAE, while LoRA adapts the weights to generate large head-pose changes. This is positioned as a replacement for or complement to AttrOP, which is a post-training gradient-descent search over latent vectors. AttrOP takes an identity vector vid, an initial perturbed vector vim, and iteratively updates vim to minimize a sum of identity loss, quality loss, and pose loss using differentiable estimators MFR, Mquality, and Mpose. In concrete terms, the generator makes an image from the current vector, measures FR similarity to the target identity, estimates quality, measures pose, then backpropagates to refine the vector. The authors say this works but becomes slow and can degrade identity for extreme poses, with the example that generating 200K images via AttrOP takes more than 20 hours on an NVIDIA L40S. LoRA pose control instead produces 200K images in under 30 minutes by conditioning on five facial landmark points, which is both much faster and more identity-preserving.

For evaluation, the paper uses two layers of protocol. First, it performs dataset diagnostics: intra-class attribute variation, inter-class separability, and intra-class identity consistency across synthetic and real datasets. Inter-class separability is defined as the fraction of identities whose average embedding has cosine similarity below 0.4 against all other identities, following the Vec2Face prior work. Intra-class consistency is the average pairwise cosine similarity among images within each identity folder, again using a pretrained FR model. These diagnostics are backed by the visual examples in Fig. 5 and the summary plots in Fig. 3–6. Second, it trains face recognition models on the synthesized datasets and reports average accuracy on standard benchmarks. The main reported benchmark set is LFW, CFP-FP, AgeDB-30, CALFW, and CPLFW, where VFace datasets are compared against other synthetic sets and real-data baselines such as CASIA-WebFace and WebFace4M. The intro also states that eight additional test sets are used to probe attribute variation, similar-looking-persons challenges, and demographic bias, and that VFace10K is state of the art on seven real-world test sets, but the exact per-dataset numbers for those extra sets are not included in the provided text. No statistical significance tests or cross-validation procedure are described in the excerpt. The code is released on GitHub, which supports reproducibility, but the paper text provided does not specify frozen pretrained weights, exact random seeds, or full training hyperparameters beyond lambda = 0.2, pose-angle sets, similarity thresholds, and the sigma values used for perturbation.

One concrete end-to-end example from the paper is the generation of the VFace300K family. The system samples an identity vector vid from N(0,1) in 512-D, uses the main Vec2Face+ generator G to synthesize an initial image, and keeps it only if the FR embedding similarity to vid is above 0.9 and the MagFace quality score is above 26. Then, for that identity, it samples 50 perturbed vectors with 40% from N(0,0.3), 40% from N(0,0.5), and 20% from N(0,0.7), normalizes them, and generates more images if similarity to the identity feature is above 0.7 and quality exceeds 24. This yields a mostly frontal base set. To inject stronger attribute diversity, the authors then run AttrOP with target yaw values in {30°, 40°, 50°, 60°, 70°, 80°}, using at most 30 gradient iterations per sample, and/or use the LoRA pose-control path to produce profile views directly from landmark conditions. The result is a larger dataset with multiple intra-class views and attributes, intended to be cleaner than prior synthetic sets because the identity space is controlled before image synthesis rather than after.

Technical innovations

  • Feature-space image generation from continuous FR embeddings, rather than discrete labels or diffusion/GAN noise-plus-condition pipelines, to enable open-ended synthetic identities.
  • Feature masked autoencoding (fMAE), which masks rows in a 49×768 identity feature map and reconstructs images under identity-aware supervision.
  • LoRA-based explicit pose control using five-point face-landmark conditions, replacing iterative latent search for large-pose synthesis.
  • A dataset construction strategy that separately addresses inter-class separability, general attribute diversity, and identity consistency, instead of optimizing only variation.
  • An empirical diagnosis that intra-class identity consistency, not just attribute spread, is a key missing factor in synthetic face training data.

Datasets

  • CASIA-WebFace — 494,414 images / 10,575 identities — public
  • DigiFace — size not stated in the provided text — public
  • SFace — size not stated in the provided text — public
  • IDiff-Face — size not stated in the provided text — public
  • DCFace — size not stated in the provided text — public
  • HSFace10K / Vec2Face — 10,000 identities — synthetic (authors’ generation)
  • VFace10K — 10,000 identities — synthetic (authors’ generation)
  • VFace100K — 100,000 identities — synthetic (authors’ generation)
  • VFace300K — 300,000 identities — synthetic (authors’ generation)
  • WebFace4M — 4,000,000 images — public
  • FH41K — 41,000 images — public
  • AffectNet — size not stated in the provided text — public

Baselines vs proposed

  • Second-best synthetic 10K-set baseline: average accuracy on LFW/CFP-FP/AgeDB-30/CALFW/CPLFW = not stated vs proposed VFace10K = 93.89%
  • CASIA-WebFace: average accuracy on five standard test sets = 94.79% vs proposed VFace100K = 94.88%
  • CASIA-WebFace: average accuracy on five standard test sets = 94.79% vs proposed VFace300K = 94.93%
  • HSFace10K / Vec2Face: inter-class separability = comparable to CASIA-WebFace vs proposed VFace10K = comparable to CASIA-WebFace (exact value not stated in text excerpt)
  • AttrOP generation: 200K images in >20 hours on NVIDIA L40S vs LoRA pose control = 200K images in <30 minutes
  • Vec2Face (prior): training time = baseline vs Vec2Face+ = 20% reduction with discriminator removed

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2507.17192.

Fig 2

Fig 2: GAN-based and diffusion-based methods combine a Gaussian noise

Fig 4

Fig 4: Comparing synthetic and real FR training sets of their inter-

Fig 6

Fig 6: Comparing six datasets of their intra-class identity consistency. All

Fig 10

Fig 10: Examples of head pose control. Leveraging the LoRA [27] parameter-

Fig 5

Fig 5 (page 14).

Fig 6

Fig 6 (page 14).

Fig 7

Fig 7 (page 14).

Fig 8

Fig 8 (page 14).

Limitations

  • The provided text does not give full training hyperparameters, seed strategy, or exact dataset splits, so exact reproduction details are incomplete in the excerpt.
  • The strongest reported headline numbers are on average accuracy; the paper provides less detail in the excerpt on per-dataset behavior for the five and eight additional test sets.
  • Synthetic identities still perform poorly on identical-twins verification: only 1 of 11 synthetic datasets exceeds 50%, which suggests the identity model is still weak on near-duplicate discrimination.
  • The paper reports larger demographic bias for synthetic-trained models than for real-trained models, but the excerpt does not identify which demographic groups or how large the gaps are.
  • AttrOP is computationally heavy and can degrade identity for extreme pose, which is why the authors needed LoRA pose control.
  • Several analysis tools are themselves pretrained classifiers (age, pose, quality, expression, beard, glasses), so attribute statistics are only as reliable as those external estimators.

Open questions / follow-ons

  • Can identity consistency be improved further without sacrificing the inter-class separability that Vec2Face-style sampling provides?
  • What training objective would directly optimize twin-verification performance, rather than only standard FR accuracy on common benchmarks?
  • How sensitive are the reported gains to the choice of pretrained FR model used both for synthesis guidance and for evaluation?
  • Can the demographic bias observed in synthetic-trained models be reduced by explicitly balancing synthetic generation across demographic attributes?

Why it matters for bot defense

For a bot-defense or CAPTCHA practitioner, the paper is relevant because it shows synthetic face data can now train FR models that are competitive with real-data training on mainstream benchmarks, which lowers the barrier for building or adapting biometric classifiers without scraping more user data. That matters on both sides: defenders may use synthetic data to bootstrap privacy-preserving liveness or face-match systems, while attackers could use it to generate training data for impersonation, bias testing, or evasion research.

The more operationally interesting lesson is that standard accuracy benchmarks are not enough. The paper’s twin-verification failure and demographic disparity results suggest that synthetic face datasets can look strong on LFW-style metrics while still being brittle in near-duplicate and fairness-sensitive settings. For CAPTCHA/bot-defense systems that use face similarity, age, pose, or demographic cues, this means synthetic data may help with coverage and privacy, but it should be tested against adversarially similar faces, twins, and subgroup shifts before being trusted in production.

Cite

bibtex
@article{arxiv2507_17192,
  title={ Vec2Face+ for Face Dataset Generation },
  author={ Haiyu Wu and Jaskirat Singh and Sicong Tian and Liang Zheng and Kevin W. Bowyer },
  journal={arXiv preprint arXiv:2507.17192},
  year={ 2025 },
  url={https://arxiv.org/abs/2507.17192}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution