DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models

Source: arXiv:2511.19910 · Published 2025-11-25 · By Jun Jia, Hongyi Miao, Yingjie Zhou, Linhan Cao, Yanwei Jiang, Wangqiu Zhou et al.

TL;DR

DLADiff addresses a gap that most prior anti-customization work leaves open: defenses for diffusion-model fine-tuning existed, but zero-shot identity customization (for example, FaceID-style methods that use a single reference image without weight updates) remained largely undefended. The paper’s core idea is to treat these as two different threat surfaces and apply two different perturbation layers: one optimized to poison fine-tuning via surrogate-model-based adversarial training, and a second, simpler perturbation targeted at fixed identity encoders used by zero-shot pipelines.

The reported results show a clear split in effectiveness by attack type. On fine-tuning defenses, DLADiff improves over prior baselines on identity-degradation metrics while also hurting visual fidelity of generated outputs, and it generalizes across DreamBooth/LoRA and across SD model versions. On zero-shot defenses, the method is much more dramatic: Table 4 shows ISMpro/ISMgen dropping from roughly 0.95/0.40-0.62 for prior defenses to about 0.07-0.09/0.038-0.084 for DLADiff across FaceID and Instance-ID, indicating that the protected images become far less useful for identity-conditioned generation.

Key findings

Table 1: On CelebA-HQ with DreamBooth and prompt "a photo of sks person", DLADiff reduces FDR to 0.201 and ISM to 0.096, versus Anti-diffusion at 0.802 FDR and 0.425 ISM.
Table 1: On VGGFace2 with the same prompt, DLADiff reaches 0.608 FDR and 0.263 ISM, compared with Anti-diffusion at 0.824 FDR and 0.318 ISM.
Table 2: When optimized against DreamBooth but evaluated under LoRA transfer, DLADiff achieves FDR = 0.728, ISM = 0.199, FID = 182.4, and MOS = 1.37, outperforming Anti-diffusion (FDR = 0.825, ISM = 0.298, MOS = 2.72).
Table 3: Under transfer from SD-v2.1 protection to SD-v1.5 fine-tuning, DLADiff obtains FDR = 0.070, ISM = 0.031, FID = 407.8, and NSFWR = 0.733, substantially stronger than Anti-diffusion (FDR = 0.258, ISM = 0.139, NSFWR = 0.493).
Table 4: For FaceID zero-shot generation on CelebA-HQ, DLADiff drives ISMpro to 0.090 and ISMgen to 0.039, versus 0.951-0.971 ISMpro and 0.405-0.412 ISMgen for prior defenses.
Table 4: For Instance-ID zero-shot generation on VGGFace2, DLADiff reaches ISMpro = 0.077 and ISMgen = 0.058, versus 0.960-0.968 ISMpro and 0.606-0.622 ISMgen for prior defenses.
Table 5 ablation: removing DSUR raises ISM from 0.096 to 0.160 and FDR from 0.201 to 0.316; removing ADFT raises ISM to 0.277 and FDR to 0.607, showing both components matter.
Table 5 ablation: removing the zero-shot layer collapses zero-shot protection, with FaceID ISMpro/ISMgen returning to 0.974/0.398 and Instance-ID to 0.965/0.618.

Threat model

The attacker is an unauthorized diffusion-model user who acquires protected personal photos and attempts to reproduce the target identity either by fine-tuning a generative model (DreamBooth, LoRA) or by using a zero-shot identity-conditioning method (FaceID, Instance-ID) that relies on a fixed face encoder. The attacker may choose prompts and model variants, but is assumed not to have access to any secret key or ability to remove the published perturbation before using the images. The defender can publish only perturbed images and may use white-box access to surrogate models during perturbation generation, but cannot assume control over the attacker’s downstream training pipeline.

Methodology — deep read

Threat model: the adversary is an unauthorized user who obtains protected portrait images and tries to customize a diffusion model to recreate a specific identity. The paper explicitly covers two adversary classes: (1) fine-tuning attackers who can run DreamBooth or LoRA on a small image set and update model weights, and (2) zero-shot attackers who feed one reference image into a fixed identity encoder such as ArcFace/FaceID/Instance-ID without changing weights. The defender can preprocess the published images to add bounded perturbations, but cannot control the attacker’s training procedure or inference prompt. The paper assumes the attacker may use different prompts at inference time and may transfer from the surrogate model used during protection to related models (for example, SD-v2.1 to SD-v1.5, or DreamBooth to LoRA).

Data and preprocessing: the evaluation uses the Anti-DreamBooth benchmark, which the paper says contains 50 identities from CelebA-HQ and 50 identities from VGGFace2, with 12-15 high-quality 512×512 portraits per identity. For fine-tuning-defense experiments, the original-resolution images are used; for zero-shot-defense experiments, images are face-aligned and normalized to 112×112. The paper also constructs a clean per-identity set Xclean for the static surrogate model; importantly, this clean set can include the images that will later be protected. For the zero-shot layer, the method extracts a facial crop xf′ via face alignment, applies the perturbation in that aligned coordinate system, and then maps it back to the original image space via inverse affine transformation. The paper mentions adding slight random noise to the affine matrix to improve robustness to landmark-detection variation.

Architecture / algorithm: DLADiff is a two-layer perturbation framework, not a new generative model. The first layer targets fine-tuning-based customization and is the more elaborate part. It introduces a dual-surrogate setup: a static surrogate UNets that is first fine-tuned on clean images with DreamBooth-style training using the prompt "a photo of sks person" (the paper says only the UNet weights are updated for efficiency), and a dynamic surrogate UNetd initialized from pretrained weights. The key novelty is the DSUR loss, which compares both cross-attention maps Mc and self-attention maps Ms between the static surrogate on clean input x and the dynamic surrogate on perturbed input x+δft. The intuition is that the static fine-tuned surrogate reflects the attention patterns the attacker would eventually learn, so optimizing against both global structure and localized identity features should produce perturbations that more closely poison actual customization than attacks based only on pretrained attention. The ADFT loop alternates three steps: maximize Latt with PGD while freezing both surrogates, maximize Lcond on the dynamic surrogate, then update the dynamic surrogate by minimizing the DreamBooth loss on perturbed images. One epoch is this three-step cycle; the paper does not clearly state the total number of epochs in the excerpted text, only that the process repeats until a preset epoch is reached.

The second layer targets zero-shot customization and is intentionally simpler. After aligning the face, the method defines a cosine-similarity loss over N identity encoders IE_i; the loss is 1 minus the weighted average similarity between the protected face and the original face. The paper says they choose N distinct encoders to improve transferability across different zero-shot methods, but the excerpt does not enumerate the exact encoder set. The perturbation δzs is then updated via PGD under an l_infinity budget. This is essentially a multi-encoder adversarial example for face recognition embeddings, then reinserted into the full image.

Training regime and hyperparameters: for the first layer, the perturbation bound is ηft = 7/255 and the optimization stride is σft = 5×10^-3. For the second layer, ηzs = 11/255 and σzs = 8×10^-4. When evaluating protection, the paper fine-tunes a pretrained model on protected images using DreamBooth with batch size 4, 400 iterations, and learning rate 5×10^-6. The surrogate for fine-tuning-defense is SD-v2.1, and transfer is tested on SD-v1.5; for zero-shot defense, the targets are FaceID (SD-v1.5) and Instance-ID (SDXL). The excerpt does not specify optimizer details beyond use of Adam in the notation tables, nor does it provide random seed strategy or wall-clock/hardware information.

Evaluation protocol and one concrete example: fine-tuning defense is measured by FDR (face detection rate using RetinaFace), ISM (identity score matching using ArcFace), FID, FIQA, and MOS. Lower FDR/ISM indicate that generated images no longer strongly resemble a valid face or the source identity; higher FID indicates worse image quality, while lower FIQA and MOS suggest degraded quality. In Table 1, on CelebA-HQ with the prompt "a photo of sks person", DLADiff yields FDR 0.201, ISM 0.096, FID 233.7, FIQA 0.225, and MOS 1.57. The comparison baselines include MIST, Anti-DB, DisDiff, and Anti-diffusion. The paper also tests prompt shift with "a dslr portrait of sks person", where performance remains strong but drops somewhat on VGGFace2. Robustness is checked under model-method transfer (DreamBooth protection, LoRA attack; Table 2) and model-version transfer (SD-v2.1 protection, SD-v1.5 attack; Table 3). Zero-shot defense is evaluated with ISMpro and ISMgen on FaceID and Instance-ID; Table 4 shows near-collapse of identity similarity for the protected images. Ablation in Table 5 isolates DSUR, ADFT, and the second layer (Anti-ZS). The paper does not report formal statistical tests, confidence intervals, cross-validation, or held-out-attacker splits in the excerpt.

Reproducibility: the excerpt does not mention code release, pretrained checkpoints, or frozen protected-image sets. It does provide enough implementation-level detail to reproduce the optimization structure in broad strokes, but some potentially important details are missing: exact number of optimization epochs, the precise set of identity encoders used in the zero-shot loss, and the complete algorithmic pseudocode are only said to appear in supplementary material. That means the method is partially reproducible from the main text, but not fully specified in the excerpt alone.

Technical innovations

Dual-Surrogate Models (DSUR): the first layer combines a static fine-tuned surrogate and a dynamic pretrained surrogate so the perturbation is optimized against both the learned attacker state and the current attacker optimization path.
Alternating Dynamic Fine-Tuning (ADFT): the paper alternates PGD updates on perturbations with updates to the dynamic surrogate, instead of treating surrogate training as a one-off precomputation.
A dedicated zero-shot defense layer: the second perturbation layer directly attacks fixed identity encoders with multi-encoder cosine-similarity PGD, rather than relying on fine-tuning-poisoning alone.
The framework explicitly separates defense objectives for weight-updating customization and weight-free customization, rather than assuming one adversarial perturbation transfers across both.

Datasets

Anti-DreamBooth benchmark — 100 identities total (50 from CelebA-HQ and 50 from VGGFace2), 12-15 portraits per identity, 512×512 images — public source datasets with benchmark construction by prior work
CelebA-HQ — subset of 50 identities, 12-15 high-quality portraits each — public
VGGFace2 — subset of 50 identities, 12-15 high-quality portraits each — public

Baselines vs proposed

MIST: on CelebA-HQ / DreamBooth, FDR = 0.980 and ISM = 0.516 vs proposed: 0.201 and 0.096
Anti-DB: on CelebA-HQ / DreamBooth, FDR = 0.851 and ISM = 0.452 vs proposed: 0.201 and 0.096
DisDiff: on CelebA-HQ / DreamBooth, FDR = 0.482 and ISM = 0.241 vs proposed: 0.201 and 0.096
Anti-diffusion: on CelebA-HQ / DreamBooth, FDR = 0.802 and ISM = 0.425 vs proposed: 0.201 and 0.096
MIST: on LoRA transfer, FDR = 0.988 and ISM = 0.388 vs proposed: 0.728 and 0.199
Anti-diffusion: on SD-v1.5 transfer, FDR = 0.258 and ISM = 0.139 vs proposed: 0.070 and 0.031
MIST: on FaceID zero-shot (CelebA-HQ), ISMpro = 0.970 and ISMgen = 0.409 vs proposed: 0.090 and 0.039
Anti-diffusion: on Instance-ID zero-shot (VGGFace2), ISMpro = 0.968 and ISMgen = 0.620 vs proposed: 0.077 and 0.058

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2511.19910.

Fig 1

Fig 1: The DLADiff framework protects personal photos by si-

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

The paper evaluates on a relatively small benchmark: 100 identities total, with 12-15 images each, so coverage of pose, age, lighting, ethnicity, and occlusion is limited.
Zero-shot defense is tested only on FaceID and Instance-ID; the paper claims generalization but does not show a broader attacker suite such as other identity encoders or emerging adapters.
The excerpt does not report runtime, memory cost, or optimization overhead of the two-stage perturbation pipeline, which matters for practical deployment at scale.
The main evaluation assumes the attacker uses standard DreamBooth/LoRA settings and the two named zero-shot methods; more adaptive attackers might change prompts, use stronger regularization, or ensemble encoders.
There is no clear statistical significance testing or multiple-seed reporting in the excerpted results.
The method may trade off some perceptual quality of protected public images even before attack, but the paper does not deeply quantify human usability of the released photos themselves beyond FIQA/MOS on generated outputs.

Open questions / follow-ons

How does the dual-layer strategy perform against stronger, adaptive attackers who explicitly optimize to remove or denoise the published perturbations before fine-tuning or encoding?
Would the zero-shot layer transfer to other identity encoders beyond ArcFace-derived or CLIP-plus-face-recognition systems, especially newer adapter families?
What is the runtime and compute cost of generating both perturbation layers per identity, and can it be amortized for large-scale photo sharing platforms?
Can the fine-tuning layer be simplified without losing most of the benefit, or is the static-surrogate pre-fine-tuning step essential?

Why it matters for bot defense

For bot-defense practitioners, DLADiff is relevant less as a CAPTCHA primitive and more as an example of tailoring defenses to distinct attacker pipelines. The key lesson is that different abuse channels may share a surface goal, such as identity theft, but require different perturbation or detection logic depending on whether the attacker updates model weights or relies on a fixed encoder. That maps well to bot-defense engineering: a single generic mitigation often underperforms against heterogeneous automation stacks.

A practical takeaway is to think in terms of attacker state and transferability. DLADiff’s first layer tries to poison an optimizer that changes over time; the second layer attacks a frozen feature extractor. In bot systems, the analogous distinction is between defenses against adaptive scraping/training loops and defenses against fixed classifiers or retrieval embeddings. The paper’s ablations also show that modular defenses can fail if optimized in the wrong order or against the wrong surrogate, which is a reminder to validate on held-out attack implementations, not just one canonical baseline.

Cite

bibtex

@article{arxiv2511_19910,
  title={ DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models },
  author={ Jun Jia and Hongyi Miao and Yingjie Zhou and Linhan Cao and Yanwei Jiang and Wangqiu Zhou and Dandan Zhu and Hua Yang and Wei Sun and Xiongkuo Min and Guangtao Zhai },
  journal={arXiv preprint arXiv:2511.19910},
  year={ 2025 },
  url={https://arxiv.org/abs/2511.19910}
}

DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​