MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Source: arXiv:2501.03931 · Published 2025-01-07 · By Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo et al.

TL;DR

MagicMirror targets a specific bottleneck in personalized video generation: you want a subject’s identity to stay stable across frames, but you also want the video to move naturally instead of looking like a copy-pasted face pasted onto a moving background. The paper’s claim is that prior ID-preserving video systems tend to fall into one of two failure modes: either they need per-subject tuning, or they preserve identity by suppressing motion diversity. MagicMirror is positioned as a zero-shot, fine-tuning-free framework built on a Video Diffusion Transformer backbone to balance those two objectives.

What is new is not just “another adapter,” but a three-part design: a dual-branch facial feature extractor splits identity from structural facial information; a lightweight cross-modal adapter adds Conditioned Adaptive Normalization (CAN) to inject face conditions into a full-attention DiT backbone; and a two-stage data/training pipeline uses synthetic identity pairs plus video post-training to compensate for scarce paired ID-video data. The reported results show a favorable tradeoff: on their benchmark, MagicMirror leads in dynamic degree, text alignment, identity similarity, and user preference, while also producing more dynamic facial motion than competing ID-preserving methods.

Key findings

On the paper’s 40-prompt human-centric benchmark, MagicMirror reaches dynamic degree 0.705, beating ConsisID (0.615), CogVideoX-I2V (0.660), DynamiCrafter (0.455), EasyAnimate-I2V (0.155), and ID-Animator (0.140).
MagicMirror reports the best text alignment in Table 1 at 0.240, narrowly above ConsisID (0.236) and ahead of CogVideoX-I2V (0.213) and ID-Animator (0.211).
MagicMirror’s average ID similarity is 0.922 in Table 1, higher than ConsisID (0.913), CogVideoX-I2V (0.901), EasyAnimate-I2V (0.903), and DynamiCrafter (0.896).
For face motion, MagicMirror achieves FMref = 0.730 and FMinter = 0.610, both better than ConsisID (0.652 / 0.601) and much higher than I2V baselines such as CogVideoX-I2V (0.413 / 0.532).
In the user study (Table 2), MagicMirror gets the highest scores in visual quality (6.97), text alignment (8.88), dynamic degree (7.02), and ID similarity (6.39).
Ablation Table 3 shows removing CAN drops ID similarity from 0.911 to 0.886 on the same training scale, indicating the normalization-based conditioning is not just cosmetic.
Table 3 also shows that the full dual-branch + CAN + pretraining setup outperforms weaker variants: the full model reaches FMinter 0.665 versus 0.568 without CAN and 0.559 without pretraining.
The paper reports that all training runs used 30K image-pretrain iterations and 5K video fine-tuning iterations on 8 NVIDIA A800 GPUs, which is useful for estimating adaptation cost and reproducibility burden.

Threat model

The implicit threat model is zero-shot identity customization under prompt-conditioned video generation: the system receives a reference face image (or a few images) and a text prompt, and must preserve identity across generated frames without per-subject finetuning. The adversarial difficulty is mainly distributional: the model must resist identity drift, facial collapse, and motion suppression while generating natural video. The paper does not consider an external attacker trying to invert, steal, or evade the model; it also does not define hard constraints on what the model cannot access beyond the reference identity inputs and training corpora.

Methodology — deep read

Threat model and task framing: this is not a security adversary model in the classical sense; the relevant “adversary” is the modeling challenge of generating a video from a single reference face while preserving identity across a changing scene, pose, and motion. The authors assume access to one or more reference images of the target identity plus a text prompt, but not per-identity optimization at test time. The method is designed to work in a zero-shot personalization setting, meaning the model should generalize to unseen identities without finetuning on each person.

Data: the training pipeline is explicitly two-stage. For the image pre-training stage, they use LAION-Face (reported as web-scale real images; exact count not given in the text beyond being a dataset source), SFHQ (120K), and FFHQ (reported as 70K + 132K in Fig. 5, though the text is a bit ambiguous about what those two numbers correspond to in the pipeline). They also synthesize identity-consistent image pairs using PhotoMakerV2, filtering candidate pairs by cosine similarity of facial embeddings: only pairs with cos(q_face^a, q_face^b) > 0.65 are kept. For the video post-training stage, they use Pexels (29K), Mixkit (120K), and a small self-collected web video set, with video captions generated by CogVLM; for image text prompts they use a 29K prompt pool generated via MiniGemini-8B captions. The paper also says they use self-reference images and synthesized image/video pairs to address scarcity of high-quality paired ID-video data. One concrete example end-to-end: an FFHQ face image is paired with a sampled human-caption prompt, PhotoMakerV2 generates a synthetic identity-conditioned image pair, ArcFace detects and embeds the face, and the pair is filtered by embedding similarity before being used in the image pretrain stage; later, a keyframe from a Pexels/Mixkit clip gets a synthesized reference image and is used in the video fine-tuning stage to teach temporal consistency.

Architecture / algorithm: MagicMirror starts from CogVideoX-5B (a Video Diffusion Transformer) and adds facial-specific modules into alternating DiT layers, specifically even-indexed layers. The facial condition encoder is dual-branch. A CLIP ViT encoder produces dense image features f. One branch, the identity branch, uses ArcFace-derived identity-aware features q_id and an identity perceiver τ_id to extract x_id. The other, the structural branch, uses a learnable query q_face and a face perceiver τ_face to extract x_face, intended to carry fine-grained structural cues needed for realistic facial motion. The identity branch is fused into the text embedding space via a fusion MLP and a token mask m so identity is only injected at identity-relevant tokens (e.g., “man”, “woman”): the adapted text embedding is a masked replacement of those tokens. The structural branch is passed as direct conditioning.

The adapter mechanism is the key DiT-specific novelty. Because CogVideoX uses full self-attention across modalities rather than the simpler U-Net cross-attention setup used by many personalization methods, the paper adds a cross-modal adapter with modality-specific adaptive normalization. For the face modality, CAN predicts shift, scale, and gate parameters from time embedding t and layer index l, with a distribution prior initialized from identity features (their Eq. 7). The normalized feature is transformed as x̄_n = x_{n-1} * (1 + σ_n) + μ_n, then residual attention/FFN is applied with gating γ_n. They also retain an explicit cross-attention path over the face features: the final attention output sums full self-attention and a separate cross-attention over x_face, using trainable key/value projections for the cross path. The authors’ stated motivation is that pure cross-attention is not enough in a full-attention DiT, so the CAN acts as a distribution-level prior while cross-attention provides direct feature guidance.

Training regime: the model is trained in two stages. Stage 1 is image pre-training for 30K iterations with global batch size 64 and a decayed learning rate starting at 1e-5. Stage 2 is video fine-tuning for 5K iterations with batch size 8, also with decayed learning rate starting at 1e-5. The paper says the training is done on a single node with 8 NVIDIA A800 GPUs. The loss is a sum of diffusion noise prediction loss and an identity-aware cosine term: L = L_noise + λ(1 - cos(q_face, D(x0))). They also follow PhotoMaker and compute denoising loss on the face region for 50% of random training samples. The text does not specify λ, optimizer, EMA, dropout, or seed strategy in the extracted portion, so those details are unclear from the source provided.

Evaluation protocol and reproducibility: evaluation is against DynamiCrafter, EasyAnimate-I2V, CogVideoX-I2V, ID-Animator, and ConsisID. The test set consists of 40 single-character prompts from VBench for demographic diversity and 40 action-specific prompts for motion assessment. Identity references are sampled from 50 identities from PubFig, generating four personalized videos per identity under varied prompts. Metrics include VBench-style dynamic degree, text alignment, Inception Score, motion smoothness via cross-frame optical flow consistency, average ID similarity using facial recognition embeddings, similarity decay across uniformly sampled frames, and two facial motion metrics FMref and FMinter computed from RetinaFace landmarks after alignment. They also run a human study with 173 participants rating motion dynamics, text-motion alignment, video quality, and identity consistency on 1–10 scales. For reproducibility, the authors state that code and model will be publicly available, but in the provided text there is no frozen release yet and several data sources are a mixture of public and self-collected datasets.

Technical innovations

Dual-branch facial conditioning separates identity preservation from motion-carrying facial structure instead of forcing a single embedding to do both jobs.
Conditioned Adaptive Normalization (CAN) adapts a full-attention Video DiT to identity conditioning by learning shift/scale/gate parameters from time, layer, and identity priors.
A masked text-embedding fusion injects identity only at semantically relevant tokens, reducing collateral distortion of the prompt.
A two-stage training scheme combines synthetic ID pairs for image pretraining with video post-training to compensate for scarce paired ID-video supervision.

Datasets

LAION-Face — web-scale; exact size not specified in extracted text — public
SFHQ — 120K — public
FFHQ — 70K + 132K (as shown in Fig. 5; exact split unclear in extracted text) — public
Pexels — 29K — public
Mixkit — 120K — public
PubFig — 50 identities used for evaluation — public
Self-collected web videos — small collection; exact size not specified — self-collected

Baselines vs proposed

DynamiCrafter: dynamic degree = 0.455 vs proposed: 0.705
EasyAnimate-I2V: text alignment = 0.177 vs proposed: 0.240
CogVideoX-I2V: Inception Score = 9.85 vs proposed: 10.59
ID-Animator: average ID similarity = 0.923 vs proposed: 0.922
ConsisID: dynamic degree = 0.615 vs proposed: 0.705
ConsisID: overall preference = 6.640 vs proposed: 7.315
CogVideoX-I2V: FMinter = 0.532 vs proposed: 0.610
Ablation w/o CAN: ID similarity = 0.886 vs full MagicMirror: 0.911

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2501.03931.

Fig 2

Fig 2: MagicMirror generates dynamic facial motion. ID-

Fig 1

Fig 1: MagicMirror generates text-to-video results given the ID reference image. Complete videos are available in https://

Fig 3

Fig 3: Overview of MagicMirror. The framework employs a dual-branch feature extraction system with ID and face perceivers,

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The paper explicitly notes difficulty with multiple identities and with preserving attributes beyond the face, such as clothing.
Evaluation is centered on a single-person, human-centric benchmark; it does not show robust multi-subject scenes or crowded interactions.
The training data mix includes synthetic pairs and some self-collected video, which may introduce bias or domain leakage that is hard to quantify from the paper text.
Some reported dataset sizes and training details are partially ambiguous in the extracted text, which makes exact reproduction harder without the appendix/code.
No held-out attacker or adversarial robustness analysis is reported; the main focus is generation quality and identity consistency, not misuse resistance.
The paper compares against several baselines, but the fairness of per-baseline tuning budgets and prompt sensitivity is not fully characterized in the excerpt.

Open questions / follow-ons

Can the dual-branch + CAN design be extended to multiple identities in the same shot without collapsing one subject’s identity?
How much of the gain comes from the architecture versus the synthetic pair generation pipeline, and does the method still help with fully real paired data?
Does the masked text-token fusion generalize to non-face attributes such as hairstyle, clothing, or accessories without overconstraining motion?
How stable is identity preservation under stronger distribution shifts: extreme poses, occlusions, long sequences, or non-photorealistic styles?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the main relevance is defensive rather than direct: this paper shows that modern video generators can preserve a target face across time while still producing believable motion, which raises the bar for liveness checks and synthetic-video detection. A system that only looks for static face swaps or frame-to-frame identity drift will be less reliable if generators can maintain both identity and motion coherence.

If you’re designing anti-bot or identity-verification pipelines, the practical takeaway is to assume that reference-image-driven video synthesis is improving quickly. That pushes defenses toward multi-signal verification: temporal micro-motion consistency, device and sensor provenance, challenge-response timing, network fingerprints, and adversarially trained detectors that see generated video families rather than just single-frame deepfakes. It also suggests that “looks like the same person across frames” is no longer a strong authenticity signal by itself.

Cite

bibtex

@article{arxiv2501_03931,
  title={ MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers },
  author={ Yuechen Zhang and Yaoyang Liu and Bin Xia and Bohao Peng and Zexin Yan and Eric Lo and Jiaya Jia },
  journal={arXiv preprint arXiv:2501.03931},
  year={ 2025 },
  url={https://arxiv.org/abs/2501.03931}
}

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​