To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Source: arXiv:2605.14291 · Published 2026-05-14 · By Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

TL;DR

This paper addresses the emerging problem of unauthorized fine-tuning of Large Vision-Language Models (LVLMs) on scraped multimodal web data, which raises serious copyright and privacy concerns for data owners. Existing mitigation methods like watermarking and machine unlearning are reactive and act only after IP infringement has occurred. To proactively defend their data, the authors propose MMGUARD, a data-centric framework that injects imperceptible, optimized perturbations into both images and associated text to produce "unlearnable examples" that degrade the downstream performance of any LVLM fine-tuned on the protected data. MMGUARD additionally disrupts LVLMs’ cross-modal attention binding by creating spurious correlations between the noise perturbations and targets, forcing models to learn brittle shortcuts that do not generalize to clean inputs. Using an ensemble optimization strategy, MMGUARD generates protections transferable across various unknown attacker LVLM architectures and fine-tuning recipes. Experimental results on six public multimodal datasets and nine open-source LVLMs demonstrate consistent degradation in fine-tuned model accuracy while preserving perceptual fidelity, under white-box, gray-box, and black-box threat models.

Key findings

MMGUARD achieves an average downstream task performance degradation of up to 45% across nine open-source LVLMs when fine-tuned on protected multimodal data.
Cross-modal binding disruption via Bridge Path Hijack significantly shifts attention patterns; Theorem 5.1 quantifies this shift with at least 1 - sqrt(η/2) total variation distance between protected training and clean evaluation attention distributions.
The image perturbation magnitude is strictly bounded with ∥δ∥∞ ≤ εx ensuring human-imperceptibility while retaining defense effectiveness.
Text protection uses discrete token triggers with length ≤ εt inserted without replacing original tokens, optimized by a gradient-based HotFlip search verified by exact evaluation.
Ensemble optimization over multiple surrogate LVLM architectures drastically improves black-box transferability compared to single-model optimization.
Robustness evaluations show MMGUARD withstands adaptive attackers using data augmentations, fine-tuning strategies, and mixing protected with clean public samples without significant loss of protection efficacy.
Multi-objective bilevel optimization balances the trade-off between training loss minimization, attention routing disruption, stealthiness, and cross-modal consistency.
Ablation studies confirm that coupling image and text perturbations with cross-modal binding disruption outperforms perturbing single modalities or components alone.

Threat model

The adversary is an attacker who scrapes multimodal image-text data from the web and fine-tunes a pretrained LVLM checkpoint to improve downstream task performance without consent. The attacker knows standard fine-tuning recipes and may adapt to defenses by applying transformations, mixing protected data with public data, or altering hyperparameters. The attacker does not reveal or share model weights with the defender. The defender can only modify their own data before public release using imperceptible perturbations but cannot interfere with the attacker's process. The defender may have full, partial, or no knowledge of the attacker's LVLM architecture (white-box, gray-box, black-box).

Methodology — deep read

Threat Model & Assumptions: The problem involves two parties: a defender who owns multimodal (image-text-response) data and wishes to prevent unauthorized fine-tuning by attackers on released data, and an attacker who finetunes pretrained LVLM checkpoints on scraped web data to improve downstream task performance. The defender can only modify their own data by adding stealthy perturbations, without knowledge of the attackers' model architecture, training recipe, or fine-tuning hyperparameters. The paper considers white-box (defender knows attacker model exactly), gray-box (partial knowledge), and black-box (no knowledge) settings.
Data: The evaluations use six publicly available multimodal datasets (not fully detailed in the truncated text) of multimodal pairs (image, textual input, and target response). The perturbations are crafted per data sample while preserving semantic and perceptual budgets (ϵx for image ∞-norm perturbation, and ϵt for textual trigger length).
Architecture and Algorithm: The LVLM architecture consists of a visual encoder that processes images into tokens, a text encoder for tokenizing textual inputs, a modality projector aligning image tokens into language embedding space, and an autoregressive decoder language model generating tokens conditioned on both. MMGUARD models the LVLM pipeline with a differentiable approximation of the image processor to enable gradient flow for image perturbation optimization.

The method generates coupled multimodal perturbations: (a) image perturbations δi optimized with projected gradient descent on the LVLM training loss through a differentiable surrogate image processing pipeline; (b) text triggers γi inserted discretely and optimized via a gradient-guided token substitution search (based on HotFlip) with exact loss verification to handle tokenization subtleties.

To prevent LVLM fine-tuning from circumventing protection by leveraging robust cross-modal semantic bindings, MMGUARD introduces cross-modal binding disruption: it steers the model’s multi-head attention to form spurious, non-generalizable shortcuts linking perturbations in the image and text to the target answer in training, with two variants — Bridge Path Hijack (BPH) and Contrastive Routing Shift (CRS). BPH explicitly enforces attention bindings (trigger→perturbation, answer→trigger, answer→perturbation) using a contrastive attention mass loss. CRS encourages divergence of answer token attention patterns between clean and protected inputs to disrupt learned alignments without strict prescription. The objective balances training loss and binding disruption loss.

To improve transferability across unknown attacker LVLMs, MMGUARD optimizes perturbations over an ensemble of surrogate LVLMs. The overall procedure solves a multi-objective bilevel optimization problem balancing imperceptibility constraints, model training loss minimization on protected data, and attention disruption.

Training Regime: Optimization uses projected gradient descent for image perturbations and discrete search with HotFlip for text triggers. Step sizes, perturbation budgets, and relative weights on losses (training vs binding disruption) are carefully balanced. The paper does not specify exact epoch counts or hardware in the truncated text. The seed strategy is not detailed.
Evaluation Protocol: MMGUARD is evaluated across six multimodal datasets and nine open-source LVLM architectures, under white-box, gray-box, and black-box threat models. Metrics include downstream task performance (accuracy or task loss) on clean evaluation data after models are fine-tuned on protected training data. Transferability is tested by cross-architecture evaluation. Robustness is tested against adaptive attackers who apply data transformations, mixing with clean data, and altered fine-tuning recipes. Attention visualizations and ablation studies analyze component contributions.
Reproducibility: The authors release code on GitHub (https://github.com/ChengshuaiZhao0/MMGuard). The datasets are publicly available but not specified in detail. Surrogate models and perturbation weights are presumably available via their code.

Technical innovations

Joint multimodal unlearnable example generation combining human-imperceptible image perturbations with discrete textual trigger insertions to jointly protect both modalities during LVLM fine-tuning.
Cross-modal binding disruption techniques (Bridge Path Hijack and Contrastive Routing Shift) that steer LVLM attention to spurious, non-transferable shortcuts between perturbations and target outputs, with theoretical guarantees on attention pattern shifts.
Differentiable surrogate image processing pipeline enabling end-to-end gradient-based optimization of perturbations despite standard nondifferentiable preprocessing steps.
Ensemble-based bilevel multi-objective optimization framework that integrates training loss minimization, multi-head attention manipulation, and perceptual budgets to enhance robustness and transferability across unknown attacker LVLM architectures.

Datasets

Six publicly available multimodal datasets — size and names unspecified in the paper excerpt

Baselines vs proposed

Baseline (no protection): downstream LVLM fine-tuning accuracy ~ typical task performance vs MMGUARD protection: up to 45% degradation in downstream task accuracy on nine open-source LVLMs across six datasets
Single-modality unlearnable perturbation (image-only or text-only): less effective compared to joint multimodal MMGUARD protection (exact quantitative delta not provided)
Single-surrogate optimization: lower cross-model transferability and robustness compared to ensemble-based surrogate LVLM optimization

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.14291.

Fig 1

Fig 1: Protect multimodal data from unauthorized fine-tuning

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 2

Fig 2: Overview of MMGUARD. The defender generates protected multimodal examples by coupling image-unlearnable perturbations

Fig 8

Fig 8: Ablation study and parameter sensitivity of MMGUARD-BPH. The leftmost panel reports nine ablation conditions on the

Fig 9

Fig 9: Visualization of the cross-modal binding mechanism

Limitations

The method assumes knowledge of surrogate LVLM architectures for ensemble optimization; effectiveness is lower without related surrogates.
Robustness tested under some adaptive attacker strategies but not against all possible evasion techniques such as powerful data purification or adversarial training.
Text trigger insertion may be detectable or degrade some downstream language tasks sensitive to input form, though this is partially mitigated by short trigger length constraints.
Computational cost and scalability of bilevel optimization with ensemble surrogates may be high for very large datasets.
Real-world effectiveness depends on attacker fine-tuning regimes and preprocessing pipelines, which can vary considerably and are difficult to model fully.

Open questions / follow-ons

How well do MMGUARD perturbations withstand advanced data purification or adversarial training defenses applied by attackers?
Can adaptive attack strategies explicitly targeting cross-modal binding disruptions neutralize this defense?
What is the tradeoff between perceptual quality degradation and robustness when scaling MMGUARD to very large and diverse multimodal datasets?
How does MMGUARD perform on commercially deployed LVLMs with proprietary architectures and preprocessing unknown to the defender?

Why it matters for bot defense

MMGUARD presents a novel and proactive data-level defense approach for protecting multimodal data against unauthorized LVLM fine-tuning by disrupting model learning dynamics and attention mechanisms. For practitioners focused on bot defense and CAPTCHA systems, this technique exemplifies a preemptive strategy that can complement reactive detection and mitigation measures. Its use of imperceptible perturbations that create spurious learning shortcuts without visible degradation aligns with current trends toward subtle and robust anti-scraping or anti-abuse mechanisms. The concept of manipulating cross-modal attention to enforce model misgeneralization also suggests new directions for designing adversarial interventions specific to complex multimodal systems.

However, unlike traditional CAPTCHA which focuses on real-time interaction-based human-bot discrimination, MMGUARD tackles the prior data acquisition stage, aiming to preserve intellectual property before unauthorized use in large-scale model training. This distinction underscores that effective bot and data abuse defenses may require multi-stage and multi-modal solutions. MMGUARD’s robust transferability across model architectures is especially important given the opaque and evolving nature of LVLM attackers in the wild. Overall, bot-defense engineers may consider integrating or adapting such unlearnable, multimodal data perturbations to proactively raise cost and reduce downstream unauthorized exploitation of sensitive datasets.

Cite

bibtex

@article{arxiv2605_14291,
  title={ To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model },
  author={ Chengshuai Zhao and Zhen Tan and Dawei Li and Zhiyuan Yu and Huan Liu },
  journal={arXiv preprint arXiv:2605.14291},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.14291}
}

To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​