SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions

Source: arXiv:2507.19673 · Published 2025-07-25 · By Babak Taati, Muhammad Muzammil, Yasamin Zarghami, Abhishek Moturu, Amirhossein Kazerouni, Hailey Reimer et al.

TL;DR

SynPAIN tackles a very specific but important data problem in automated pain assessment: most public facial-pain datasets are small, skew younger, and underrepresent the older adults and demographic diversity needed for clinical deployment in dementia care. The paper’s main idea is to use commercial generative AI tools to build a large, publicly shareable synthetic dataset of paired neutral and expressive faces, with explicit labels for age, gender, ethnicity/race, and expression type. The key design choice is that each identity comes as a neutral/expressive pair, which makes the dataset directly useful for pairwise pain detectors like PwCT that compare an expression against an individualized baseline.

What is new is not just synthetic faces per se, but the combination of demographic balancing, clinically motivated pain prompt engineering, and evaluation targeted at bias detection and augmentation. The authors validate the synthetic pain images using off-the-shelf facial action unit analysis and show that pain images score higher on PSPI than neutral and non-pain images; they then use the dataset to expose demographic performance gaps in existing pain models and to test augmentation. The headline downstream result is that adding age-matched SynPAIN data to training improves average precision by 7.0% on real clinical data, while also revealing substantial age, gender, and ethnicity/race disparities that were hard to see in smaller datasets.

Key findings

SynPAIN contains 10,710 facial expression images organized as 5,355 neutral/expressive pairs across 5 ethnicity/race groups, 2 age groups (20–35, 75+), and 2 genders.
Face quality is high by two independent checks: DSL-FIQA averaged 0.868 overall (neutral 0.865, expressive 0.871) and Py-Feat face-detection confidence was 0.999 for neutral vs 0.998 for expressive images.
FaceReader AU detection failed far more often on pain images (12.4%) than on neutral (1.2%) or non-pain expressions (1.9%); by ethnicity/race, failure was 29.9% for Black pain images vs 3.3% for East Asian pain images and 7.9% for Middle Eastern pain images.
Computed PSPI increased monotonically across conditions: neutral mean 2.9, non-pain mean 4.3, pain mean 6.7; all pairwise differences were significant with Mann-Whitney U tests at p < 10^-5.
Identity consistency was strong for matched pairs: neutral/expressive images of the same identity had median cosine similarity 0.72, versus 0.19 for non-matching identities (Mann-Whitney U p < 0.001, Cohen’s d = 2.45).
Within-dataset 5-fold CV with the pairwise pain model showed age bias: AUROC 0.755 on young vs 0.692 on old; gender was nearly balanced overall at 0.723 for men vs 0.728 for women.
Age-stratified training exposed poor cross-age transfer: training only on young faces gave 0.752 AUROC on young vs 0.695 on old, while older-only training was worse overall, including 0.643 AUROC on old faces.
Age-matched synthetic augmentation improved real-data pain detection average precision by 7.0% on the UofR clinical dataset (the paper states AP improvement; the excerpted text does not provide the exact before/after AP values).

Threat model

The relevant adversary is distribution shift and demographic underrepresentation rather than an active attacker: pain-detection systems may be deployed on older adults with dementia, but most available training data come from younger or less diverse cohorts. The model is assumed to see facial images of people whose age, gender, and ethnicity/race may differ from the training set, and the practitioner cannot assume self-report labels or exhaustive clinical supervision. The paper does not model spoofing, adversarial examples, or malicious manipulation of the synthetic data.

Methodology — deep read

The paper’s threat model is not an adversarial security setting in the usual sense; it is a dataset-bias and generalization setting for pain detection. The implicit adversary is distribution shift: models trained on younger, narrower, or less diverse facial datasets may fail on older adults, women/men asymmetrically, or underrepresented ethnic/racial groups. The authors assume a practitioner wants a pain detector that generalizes to older adults with dementia, where self-report is limited and facial behavior matters. They also assume that synthetic faces can be used for privacy-preserving sharing and augmentation, but they do not claim that the synthetic images are a perfect substitute for real clinical data.

Data generation is fully synthetic and is built around paired images per identity: one neutral baseline and one expressive image, where the expressive image is either pain or non-pain. The dataset totals 10,710 images, which equals 5,355 pairs. The demographic attributes are balanced as much as possible across two age bands (young 20–35, old 75+), two genders (male, female), and five ethnicity/race groups (Black, White, Middle Eastern, South Asian, East Asian). The paper says they used commercial generative AI tools after qualitatively evaluating multiple options; Ideogram 2.0 was selected for image generation quality, and RunwayML Gen-4 Alpha was used to generate 5-second, 24 fps videos for 40 identities that transition from neutral to expressive faces. The prompts varied clothing, background, hair/facial hair, demographic descriptors, and expression wording. For pain, prompts were intentionally tied to clinically relevant facial-action descriptions, including PSPI-related cues such as brow lowering and cheek raising, and PACSLAC-II-related descriptors such as groaning. After generation, they manually excluded a small number of images with profile views, artifacts, or unrealistic expressions, which explains why the final counts are only approximately balanced rather than perfectly balanced.

Architecturally, the dataset itself is the contribution, but the validation pipeline matters. For image quality, they use DSL-FIQA, a face-image-quality model that outputs scores in [0,1], and Py-Feat face detection confidence plus estimated head pose (roll, pitch, yaw) to check frontal alignment. For pain-expression validity, they rely on FaceReader v9.1, a commercial AU detector, to estimate the PSPI pain score from AUs AU4, AU6, AU7, AU9, AU10, and AU43. PSPI is computed as AU4 + max(AU6, AU7) + max(AU9, AU10) + AU43. This is important: the authors are not training a new AU model; they are using an off-the-shelf clinical proxy to ask whether generated pain faces elicit the expected action-unit patterns. They also use Py-Feat face encodings to measure identity preservation: cosine similarity between neutral and expressive pairs of the same identity versus between unrelated identities, and then they extend that to look for effective diversity within demographic subgroups.

Training and evaluation of the pain detector follow the pairwise pain model from prior work [31], which compares an expressive face to a neutral baseline. The paper says they train this pairwise model from scratch with hyperparameters configured according to PainControl [61], but the excerpt does not provide the full optimizer, learning-rate schedule, batch size, number of epochs, random seed strategy, or exact backbone details. What is clear is that they use 5-fold cross-validation with 60/20/20 train/validation/test splits for within-dataset experiments. They also perform stratified training by age, gender, and ethnicity/race to see how performance transfers across groups. For external evaluation, they test the released PwCT checkpoint (trained on UNBC-McMaster + UofR, and also compare against UNBC-McMaster-only in passing) on SynPAIN demographic subsets. Finally, for augmentation they add only the 2,895 old identities from SynPAIN to the real training data, specifically to match the target older-adult population, and evaluate on the UofR clinical dataset. The excerpt clearly says this improves average precision by 7.0%, but the exact absolute AP values are not present in the provided text.

A concrete end-to-end example: suppose the model sees an old woman identity generated by Ideogram with a neutral portrait and a paired pain portrait. The neutral image is checked for quality and frontal alignment, then the expressive image is checked to ensure it is not a profile or artifact. FaceReader is run on both. If the pain image contains AU4/AU6/AU7/AU9/AU10/AU43 patterns more strongly than the neutral image, the PSPI computed from the detected AUs rises; in the paper’s aggregate results, pain faces average PSPI 6.7 vs 4.3 for non-pain and 2.9 for neutral. The pairwise detector can then be trained to compare the expressive face to its neutral baseline, learning pain-related change rather than absolute face appearance. In evaluation, this same pairing structure is used to probe whether the detector is less accurate on old faces, on men, or on underrepresented ethnic/racial groups. The paper then uses the synthetic old faces to augment real older-adult data and reports improved AP on the held-out real clinical benchmark.

Reproducibility is mixed but reasonably strong for a synthetic-data paper. The dataset is publicly available at a DOI, and the paper explicitly identifies the main generation tools used: Ideogram 2.0 and RunwayML Gen-4 Alpha. However, because the data are generated through commercial APIs and the paper relies on those services’ proprietary models, full bitwise reproducibility is unlikely. The excerpt does not mention release of generation prompts, code, or frozen checkpoints for the synthetic pipeline, although the dataset itself is public and the downstream PwCT checkpoint appears to be released by the authors of the prior work. Several evaluation components are also dependent on commercial or third-party tools (FaceReader, DSL-FIQA, Py-Feat), which may complicate exact replication across labs.

Technical innovations

A publicly available synthetic pain dataset with paired neutral/expressive images and explicit labels for age, gender, and ethnicity/race, designed specifically for older-adult pain detection.
A prompt-engineered generation pipeline that attempts to synthesize clinically meaningful pain expressions rather than merely transferring pain from real source faces.
A multi-pronged validation strategy combining face-image quality scoring, AU-based PSPI estimation, identity-consistency analysis, and demographic diversity analysis.
Use of synthetic data both to surface algorithmic bias in an existing pairwise pain detector and to improve real-world performance through age-matched augmentation.

Datasets

SynPAIN — 10,710 images (5,355 neutral/expressive pairs) — generated with commercial tools; publicly available via DOI 10.5683/SP3/WCXMAP
UNBC-McMaster Shoulder Pain Expression Archive Database — 48,398 frames from 25 participants — public
BioVid Heat Pain Database — 90 participants — public
UofR dataset — 102 older adult participants (95 manually annotated for PSPI/PACSLAC-II) — non-public due to ethical considerations

Baselines vs proposed

FaceReader AU detection on SynPAIN: failure rate = 4.2% overall vs proposed validation success on 95.8% of images
Neutral vs non-pain vs pain PSPI (FaceReader-derived): mean PSPI = 2.9 vs 4.3 vs 6.7
Same-identity vs different-identity Py-Feat cosine similarity: median = 0.72 vs 0.19
Within-dataset 5-fold CV AUROC (pairwise pain model): young = 0.755 vs old = 0.692
Within-dataset 5-fold CV AUROC (pairwise pain model): man = 0.723 vs woman = 0.728
Pretrained PwCT on SynPAIN: AUROC = 0.696, AP = 0.710, F1 = 0.696
Pretrained PwCT by age: young AUROC = 0.729 vs old AUROC = 0.663
Pretrained PwCT by gender: man AUROC = 0.670 vs woman AUROC = 0.749
Augmentation on real clinical data: age-matched synthetic data increases average precision by 7.0%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2507.19673.

Fig 1

Fig 1: Sample image pairs from the SynPAIN dataset. Top row: Non-pain; bottom two rows: Pain; left three columns: Old; rightmost column: Young.

Fig 2

Fig 2: Sample frames from a 5-second video showing the progression from a neutral expression to a facial expression of pain. The number below shows

Fig 3

Fig 3: Distribution of PSPI scores calculated from FaceReader AU detec-

Fig 4

Fig 4 (page 3).

Fig 5

Fig 5 (page 3).

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

The synthetic images are validated indirectly with AU detectors and face-embedding similarity, not by human clinical raters or patient-ground-truth pain reports.
Generation uses proprietary commercial tools, so the exact synthetic pipeline is not fully reproducible end-to-end outside those services.
The excerpt does not provide full training hyperparameters, backbone details, or seed strategy for the pairwise model experiments.
Some demographic groups may still have limited effective diversity; the paper itself finds higher within-group identity similarity among women, especially East Asian women.
The paper demonstrates bias and augmentation on one main pairwise pain detector and one clinical dataset, so generalization to other architectures or care settings remains untested.
The augmentation result is summarized as a 7.0% AP gain, but the excerpt does not include the absolute AP values or confidence intervals.

Open questions / follow-ons

How much of the observed bias comes from the generator itself versus the downstream pairwise pain model, and can those two sources be disentangled with a stronger factorial study?
Would human raters or clinicians agree that the synthetic pain faces look pain-like across all demographics, especially in the older adult subgroup?
Can synthetic data help for video-based temporal pain recognition, not just static pairwise comparison, given that pain is often expressed as a dynamic change?
Would the augmentation gains hold for other architectures and other clinical datasets beyond UofR/PwCT?

Why it matters for bot defense

For bot-defense practitioners, the main lesson is methodological rather than domain-specific: demographic coverage and synthetic augmentation can expose failure modes that are invisible on small or narrow datasets. If you are building face-based risk signals, liveness checks, or accessibility-sensitive verification flows, SynPAIN is a reminder that performance claims can look balanced on paper while still hiding large subgroup gaps once you test at larger scale or on underrepresented cohorts.

The paper also shows a practical pattern worth borrowing: generate controlled synthetic data with known attribute balances, validate it with independent tooling, then use it both for bias audits and for targeted augmentation. In a CAPTCHA or face-risk setting, that same workflow could help stress-test whether a model overfits age, skin tone, or expression artifacts. The caveat is that synthetic data should be treated as a diagnostic and augmentation tool, not a substitute for real evaluation on the actual deployment population.

Cite

bibtex

@article{arxiv2507_19673,
  title={ SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions },
  author={ Babak Taati and Muhammad Muzammil and Yasamin Zarghami and Abhishek Moturu and Amirhossein Kazerouni and Hailey Reimer and Alex Mihailidis and Thomas Hadjistavropoulos },
  journal={arXiv preprint arXiv:2507.19673},
  year={ 2025 },
  url={https://arxiv.org/abs/2507.19673}
}

SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​