Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation

Source: arXiv:2512.19275 · Published 2025-12-22 · By Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu

TL;DR

This paper asks a simple but important question for bot-defense and biometric security: if a generative human-animation model makes a video look convincing, does it also preserve the subject’s gait identity? The authors argue that visual realism and biometric fidelity are not the same thing, so they evaluate four state-of-the-art animation systems—MagicAnimate, AnimateAnyone, UniAnimate, and MTVCrafter—on whether they can reconstruct gait from a driving video and whether they can transfer that gait to a different visual identity.

The main result is that visual quality is often decent, but identification fidelity is much weaker and highly modality-dependent. Appearance-heavy generators can look good and preserve clothing/texture well, yet fail to preserve the motion cues that gait systems need. Conversely, the strongest motion-preserving system in their study is MagicAnimate, which uses DensePose guidance. Their identity-transfer experiment is especially revealing: when motion and appearance are intentionally disentangled, RGB-based gait recognizers largely track the reference identity rather than the driving subject, indicating that many current gait pipelines behave more like re-identification models than true gait-motion recognizers.

Key findings

On SUSTech1K baseline restoration, MagicAnimate achieved the best pixel metrics: SSIM 0.82, PSNR 19.51, LPIPS 0.17, outperforming AnimateAnyone (0.77/16.50/0.26) and UniAnimate (0.80/17.73/0.22).
On the harder CCGR-MINI in-the-wild restoration setting, UniAnimate had the best visual scores among generated videos: SSIM 0.74, PSNR 16.02, LPIPS 0.28; MagicAnimate dropped to 0.72/15.41/0.28 and AnimateAnyone to 0.70/15.15/0.31.
In baseline gait restoration, AnimateAnyone led BigGait RGB Rank-1 at 87.50%, but fell to 26.70% on DeepGaitV2 silhouettes and 32.10% on SkeletonGait++—evidence that high texture fidelity did not translate to motion fidelity.
MagicAnimate was consistently strongest on structural gait evaluators in restoration: baseline Rank-1 of 58.90% on DeepGaitV2, 69.80% on SkeletonGait++, and 54.51% on SkeletonGait, all higher than the other generators on those modalities.
Under pose-offset restoration, MagicAnimate remained strongest on structural metrics: 56.70% Rank-1 on DeepGaitV2, 64.06% on SkeletonGait++, and 50.15% on SkeletonGait, while MTVCrafter dropped to 9.99% on SkeletonGait.
In the in-the-wild CCGR-MINI setting, all generators collapsed sharply; the best generated Rank-1 values were only about 12% on BigGait (AnimateAnyone 12.00%) and 10.61% on SkeletonGait++ (UniAnimate 10.61%), far below real-data baselines (~95% on controlled settings, ~21–23% in the wild).
In identity transfer, BigGait Rank-1 against the driving identity was near-zero for all generators (e.g., MagicAnimate 0.60%, AnimateAnyone 0.20%, UniAnimate 0.40%, MTVCrafter 0.20%), showing that motion identity was largely not preserved when appearance was swapped.
The reverse identity-transfer check showed strong reference-image leakage: on BigGait, AnimateAnyone reached 86.10% Rank-1 against the reference identity, essentially matching its gait-restoration baseline performance and supporting the authors’ claim that RGB gait recognition is heavily appearance-driven.

Threat model

The implicit adversary is a generative human-animation system that receives a reference image plus driving motion and may produce a synthetic video that either preserves or obscures gait identity. The study assumes the generator does not have access to the evaluation identities during training and that the attack surface is limited to synthesis fidelity, not active model inversion or white-box optimization against the gait recognizers. The gait evaluators are treated as fixed black-box consumers of generated clips; the generators are not adapted to exploit them. The paper does not address adversaries who can post-process outputs, target specific recognizers, or train on the same identities.

Methodology — deep read

The threat setting is not an adversarial attack benchmark in the classic sense; it is a fidelity test for GenAI human-animation systems under controlled and zero-shot conditions. The adversary, implicitly, is the generator itself: can it preserve gait identity across synthesis, and can a downstream gait recognizer recover the driving subject’s identity after appearance has been changed? The authors assume the evaluated animation models have access to a reference image plus a driving video or pose stream, and that the test identities are unseen by the animation models. They are not testing a malicious adaptive attacker who actively optimizes against gait recognizers; rather, they test whether current models can accidentally or intentionally produce a biometrically faithful synthetic person. For the gait evaluators, the assumption is that each model should react to the generated video only through the modality it consumes: RGB, silhouette, or skeleton.

Data comes from two public gait datasets: SUSTech1K, described as a large-scale controlled-environment dataset with multiple viewpoints, and CCGR-MINI, which stresses cross-condition recognition with harder changes such as clothing and carrying objects. The paper selects 250 identities from each dataset, and for each identity generates four videos per GenAI model, yielding 16,000 generated videos total, or 1,000 per GenAI method per dataset. The excerpt states that none of the test identities or sequences are included in the training data of the evaluated animation models, so evaluation is on strictly unseen data. For Task 1 (Gait Restoration), they use the first frame of the source video as the reference image and the remaining frames as the driving motion; Scenario B introduces pose misalignment between the reference and driving frames; Scenario C reuses the same first-frame/remaining-frames protocol but on CCGR-MINI. For Task 2 (Identity Transfer), the reference image and the driving motion come from different identities, with no pixel-aligned ground truth.

The architecture side is a two-stage pipeline: first generate animation, then evaluate gait identity. They test four state-of-the-art animation models spanning different motion-conditioning schemes. MagicAnimate uses DensePose guidance and a dense video diffusion model with an appearance encoder plus video fusion; AnimateAnyone uses a 2D skeleton pose guider and ReferenceNet; UniAnimate uses 17-keypoint DWPose-style guidance plus a unified noise input to support long-range temporal coherence; MTVCrafter uses SMPL joints and 4D motion tokens in a Video DiT backbone. On the evaluation side, they implement four gait recognizers within OpenGait and force them onto a shared GaitBase embedding head to reduce confounds from different projection heads. The four recognizers are DeepGaitV2 (silhouette/RGB-like appearance-based baseline in their framing), SkeletonGait (skeleton-only), SkeletonGait++ (skeleton + silhouettes), and BigGait (RGB with DINOv2 features). This alignment is important: any result differences are intended to come from the generated inputs, not from a different final classifier.

The training regime for the animation models is not re-trained by the authors; they use pretrained off-the-shelf checkpoints from the cited projects. The excerpt does not report any new training of the generators or gait models, no optimizer choices, no epoch counts, and no random-seed protocol. The experimental work is therefore evaluation-only. For preprocessing, the paper masks foreground person regions when computing SSIM/PSNR/LPIPS so that background hallucinations do not dominate reconstruction scores. The generated videos are then converted into each gait modality expected by the downstream recognizers: silhouettes, skeleton maps, or RGB clips. That modality conversion is central to the paper’s logic because it lets the authors ask whether the same synthetic video contains both appearance and motion evidence, or only one of them.

Evaluation is split into two tasks. Task 1 measures gait restoration quality with three scenarios: Baseline, Off-Pose, and In-the-Wild. For these, the authors report SSIM, PSNR, and LPIPS for visual fidelity, then Rank-1, Rank-5, and mAP for gait recognition using each of the four evaluators. Task 2 is a zero-shot identity-transfer stress test. Because there is no ground-truth frame pair for the transferred identity, they compare the generated output against the reference identity and separately against the driving identity. The key diagnostic is whether identity scores come from the person whose appearance was supplied or from the person whose motion was supplied. One concrete example from Table 4: AnimateAnyone on BigGait gets 86.10% Rank-1 against the reference identity but only 0.20% against the driving motion, which is the paper’s clearest evidence that the model preserves appearance much better than behavioral identity. The excerpt does not mention statistical significance tests, cross-validation, or confidence intervals, and no code-release or frozen-weight details are provided here beyond links to the original model repositories.

Technical innovations

They benchmark four modern human-animation generators specifically on biometric fidelity, not just visual quality, using gait recognition as the downstream test.
They introduce a two-task evaluation design: gait restoration under increasing difficulty and zero-shot identity transfer with reference-vs-driving validation.
They compare generated videos across multiple gait modalities (RGB, silhouettes, skeletons) to separate appearance preservation from motion preservation.
They show that dense structural guidance (DensePose/SMPL-style) can improve gait fidelity relative to sparse 2D keypoints, but only partially under real-world shifts.

Datasets

SUSTech1K — 250 identities selected; 1,000 generated videos per GenAI method and dataset (4 per identity) — public dataset
CCGR-MINI — 250 identities selected; 1,000 generated videos per GenAI method and dataset (4 per identity) — public dataset

Baselines vs proposed

MagicAnimate vs AnimateAnyone on SUSTech1K baseline visual quality: SSIM = 0.82 vs proposed 0.77; PSNR = 19.51 vs 16.50; LPIPS = 0.17 vs 0.26
UniAnimate vs MagicAnimate on CCGR-MINI visual quality: SSIM = 0.74 vs proposed 0.72; PSNR = 16.02 vs 15.41; LPIPS = 0.28 vs 0.28
AnimateAnyone vs MagicAnimate on baseline BigGait Rank-1: 87.50% vs 75.80%
MagicAnimate vs AnimateAnyone on baseline DeepGaitV2 Rank-1: 58.90% vs 26.70%
MagicAnimate vs UniAnimate on baseline SkeletonGait++ Rank-1: 69.80% vs 38.90%
Real Data vs best generated on in-the-wild BigGait Rank-1: 22.60% vs 12.00%
AnimateAnyone vs MagicAnimate on identity-transfer BigGait, driving-motion Rank-1: 0.20% vs 0.60%
AnimateAnyone vs MagicAnimate on identity-transfer BigGait, reference-image Rank-1: 86.10% vs 61.00%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2512.19275.

Fig 1

Fig 1: Motivation. We examine whether synthetic videos generated through state-of-the-art

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 2

Fig 2: Overview of the generation and evaluation pipeline. The pipeline extracts different

Limitations

The excerpt provides no statistical significance testing, confidence intervals, or variance across random seeds, so it is hard to judge robustness beyond point estimates.
The evaluation uses only two datasets and 250 identities per dataset, which is useful but still modest relative to the diversity of real-world gait conditions.
The paper evaluates pretrained generators rather than training or fine-tuning them for biometric fidelity, so it cannot separate model-inherent limits from checkpoint-specific limits.
Task 2 has no pixel-aligned ground truth, so it relies entirely on downstream recognition as the proxy for fidelity.
The generation setups are constrained to specific reference-image plus driving-video protocols; other prompting or conditioning regimes may behave differently.
The authors focus on gait-recognition models in OpenGait, but do not test whether a different biometric recognizer or temporal model would change the appearance-vs-motion conclusion.

Open questions / follow-ons

Would training the generator explicitly with a gait-preservation loss improve motion identity without collapsing visual quality?
How would results change under cross-dataset transfer to more diverse surveillance footage or outdoor walking data?
Can a truly motion-centric gait recognizer maintain accuracy when appearance is systematically swapped, or are current skeleton pipelines still too brittle?
Is there a measurable trade-off frontier between identity-preserving appearance and motion-preserving structure for different generator architectures?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the practical takeaway is that visual realism alone is a weak assurance signal. A synthetic human video may look convincing while still failing to preserve the behavioral cues that some security systems implicitly rely on, and the reverse is also true: a generator can preserve clothing or body shape enough to fool RGB-based re-identification while not reproducing the underlying gait motion. That matters if a defense pipeline uses gait as a secondary liveness or identity signal, because appearance-driven recognizers may be brittle to identity transfer and can be spoofed by synthetic content that carries the right texture but not the right motion.

More broadly, the paper suggests that testing against multiple modalities is essential. If a defender only checks RGB-level realism, they may miss that the motion signature has been lost or fabricated; if they only check silhouettes or skeletons, they may underestimate appearance leakage. A robust evaluation stack for synthetic-human abuse detection should therefore compare appearance-based and motion-based cues separately, ideally under held-out identities and cross-condition shifts similar to the paper’s baseline, pose-offset, and in-the-wild scenarios. For practitioners, the work is a reminder to treat gait recognition as a biometric system with distinct vulnerability surfaces, not as a single monolithic notion of “human-likeness.”

Cite

bibtex

@article{arxiv2512_19275,
  title={ Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation },
  author={ Ivan DeAndres-Tame and Chengwei Ye and Ruben Tolosana and Ruben Vera-Rodriguez and Shiqi Yu },
  journal={arXiv preprint arXiv:2512.19275},
  year={ 2025 },
  url={https://arxiv.org/abs/2512.19275}
}

Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​