Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Source: arXiv:2606.03911 · Published 2026-06-02 · By Yoad Tewel, Yuval Atzmon, Gal Chechik, Lior Wolf

TL;DR

This paper addresses the challenge of training high-quality visual editing models without paired source-target data, which is especially difficult for video editing where paired data is extremely costly or impossible to collect. The authors propose Bootstrap Your Generator (ByG), a novel framework to unpaired train flow-matching models for both image and video editing. ByG leverages the intrinsic knowledge within a pretrained text-to-image model as supervision and enforces cycle consistency to preserve source content, removing the need for external reward signals or paired data. Key technical contributions include bootstrapping self-generated noisy training targets, using a directional prior loss derived from the frozen base model to guide instruction adherence, and a gradient routing mechanism via straight-through estimation to bridge gaps between noisy training and clean inference conditions. Extensive experiments demonstrate state-of-the-art results across challenging, data-scarce long-tail style and video editing benchmarks. Human user studies show ByG outperforms fully supervised baselines trained on millions of video pairs with win rates above 75%. It also generalizes reliably to unseen domains and target styles, outperforming supervised and zero-shot baselines. Ablations confirm each component’s importance for balancing instruction following and source preservation. This framework offers a scalable, general approach to instruction-guided editing using unpaired data and existing pretrained generators.

Key findings

On video editing, ByG achieves 75.3%±2.2% user preference win rate vs Ditto (a supervised model trained on one million paired video editing examples).
ByG outperforms Ditto by 85% to 15% on out-of-distribution 3D-CGI videos, demonstrating strong domain generalization (Fig 3).
Quantitative video metrics show ByG improves CLIP directional similarity (edit success) from 0.091 to 0.104 and DINO similarity (source preservation) from 0.536 to 0.718 over Ditto (Table 1).
On a 335-image long-tail style edit benchmark with 12 style classes unseen during training, ByG outperforms supervised baselines FLUX-Kontext and Qwen-Image-Edit in semantic (7.67 vs 6.87/6.86) and overall scores (8.30 vs 7.85/7.75) (Table 2).
On the general GEdit-Bench dataset, ByG matches or exceeds FLUX-Kontext across most edit categories (e.g., motion change 6.88 vs 4.53), but underperforms on subject removal and text change where paired data helps (Table 3).
Ablation shows removing cycle loss or gradient routing improves edit success but harms source preservation, demonstrating the tradeoff between editing and content fidelity (Table 4).
Removing bootstrap-generated noisy inputs degrades both edit success and source preservation drastically (5.5 and 7.0 vs 8.3 and 7.6) (Table 4).
The directional loss stabilizes training by aligning velocity differences with the frozen model’s edit direction, preventing collapse to identity.

Threat model

n/a - This paper focuses on unpaired training methodologies for generative visual editing and does not define an adversarial threat model or adversary capabilities.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled as this is a generative model training paper focused on unpaired supervision rather than security. The method assumes availability of pretrained text-to-image base models and paired source and edited captions describing original and target image/video states, but no paired source-target images or videos. The editing instructions are textual cues describing the desired visual change. The model cannot rely on explicit ground-truth edited images or external reward models.
Data: Unpaired datasets with source images/videos and corresponding captions (psrc) alongside target captions (ptgt) describing the desired edits are used. For video editing, the Wan2.2 text-to-video model is adapted. Training videos are sampled from VideoUFO and filtered for style labels (cartoon or photo-realistic). The video training set contains approximately 165 cartoon and 163 photorealistic videos, plus held-out sets. Style benchmarks include 335 images from six long-tail styles not seen during training, with 487 edit instructions. The GEdit-Bench dataset serves for general editing evaluation.
Architecture and Algorithm: ByG converts a pretrained text-to-image diffusion model Gt2i into an editing model Gedit which is conditioned on the source image x, editing instruction c, and noisy image input yt. Training requires generating pseudo-noisy targets ˜yt by running an exponential moving average (EMA) frozen copy of Gedit in a multi-step sampling process from noise to the current timestep t.

The editing model predicts velocity fields to reverse the noising process. Supervision is provided via two complementary losses:

Prior (Instruction Following) Loss: Aligns the editing model’s predicted velocity difference (vfwd - vsrc) with the directional velocity difference from the frozen base model (vtgt - vsrc), assessed with cosine similarity plus MSE regularization for stability.
Cycle Consistency Loss: Enforces that forward editing followed by reverse editing under the inverse instruction recovers the source image, teaching the model to preserve source structure.

A gradient routing mechanism based on Straight-Through Estimation detaches gradients during the reverse pass to flow through a one-step prediction ˆy while conditioning on a clean multi-step EMA output ˜y0. This overcomes the train-inference gap caused by blurry one-step outputs during training.

An identity loss further encourages reconstructing the input when the inverse instruction is identical to the input condition.

Training Regime: Training proceeds by sampling t, noise ϵ, source image x, and instruction c. At each step, the frozen EMA model produces ˜yt, used as input to the trainable model which generates velocity predictions. The combined losses are applied with balancing hyperparameters (e.g., α for directional vs MSE, λprior, λid). ByG iteratively updates trainable weights and EMA parameters using Adam optimizer. Details on epochs, batch size, hardware, and seeds are in the appendix but not fully disclosed in the text.
Evaluation Protocol: Evaluation uses a mix of automatic metrics and human user studies. Semantic consistency and perceptual quality scores are obtained by large language model VIEScore judging edited outputs relative to instructions. Video evaluation includes CLIP directional similarity, DINO frame similarity for source preservation, motion fidelity, and VBench metrics for temporal attributes. User studies gather preferences in pairwise comparisons with multiple raters and statistical tests (binomial, sign test, Fleiss’ kappa) to quantify significance. Baselines include supervised models trained on millions of paired examples such as Ditto and FLUX-Kontext.
Reproducibility: Code release is not explicitly mentioned. The method depends on pretrained models such as Wan2.2 and FLUX.1 backbones and large data sources, some not publicly available. Exact hyperparameters and datasets are detailed in appendices but not fully reproducible without access to these assets.

Concrete Example: For video editing from cartoon to photo-realistic, the EMA copy of the model generates a noisy version of the target at timestep t. The editing model predicts denoising velocities aligned directionally with the frozen base model’s velocity difference for captions describing the source and target. Cycle consistency is enforced by applying the inverse instruction to the edited output to reconstruct the original input. Gradient routing allows clean multi-step conditioned inference while backpropagating gradients through the one-step predictions. This self-supervised loop proceeds without paired video frames, enabling effective editing learned from unpaired captioned videos alone.

Technical innovations

A bootstrapping procedure that uses a frozen EMA copy of the editing model to generate pseudo-noisy target inputs for training without paired ground truth.
A directional prior loss that supervises the editing model by aligning its velocity differences with those extracted from a frozen pretrained text-to-image model, enabling instruction-following guidance from the base model itself.
A novel gradient routing mechanism based on Straight-Through Estimation to bridge the gap between noisy training states and clean inference conditions by conditioning on clean multi-step EMA outputs while backpropagating through one-step predictions.
Integration of cycle consistency with directional prior loss in a unified framework, ensuring both instruction adherence and source content preservation without paired samples.

Datasets

VideoUFO — ~328 unpaired videos (post-filtering) — used for video editing training
Ultravid video subsets — 119 annotated editing tasks across cartoon, photo-realistic, and 3D-CGI styles — constructed for video editing evaluation
Long-tail style benchmark — 335 images with 487 edit instructions across six rare stylization domains — constructed for style-based editing evaluation
GEdit-Bench (English subset) — diverse real-world editing instructions — used for general image editing evaluation

Baselines vs proposed

Ditto (Bai et al., 2025a): user preference win rate = 24.7%, proposed ByG = 75.3% ± 2.2% on video editing (user study)
Ditto CLIP directional similarity = 0.091 ± 0.007, ByG = 0.104 ± 0.005 (edit success, Table 1)
Ditto DINO similarity = 0.536 ± 0.017, ByG = 0.718 ± 0.012 (source preservation, Table 1)
FLUX-Kontext (Labs et al., 2025): Semantic score = 6.87, ByG = 7.67 (style → photorealistic long-tail editing, Table 2)
Qwen-Image-Edit (Wu et al., 2025): Semantic = 6.86, ByG = 7.67 (style → photorealistic long-tail editing, Table 2)
FlowEdit zero-shot baseline Semantic = 4.27, ByG = 7.67 (style → photorealistic editing, Table 2)
On GEdit-Bench, ByG outperforms FLUX-Kontext in motion change (6.88 vs 4.53) and style change (6.95 vs 5.90), but underperforms on subject removal (1.91 vs 6.94) (Table 3)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03911.

Fig 1

Fig 1: Bootstrap Your Generator. Left: Supervised training requires paired source–target samples to provide explicit editing

Fig 2

Fig 2: Method overview. Top: Supervised training for image editing. Given a source image x, target image y, and editing instruction

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Performance depends strongly on the pretrained base model’s knowledge; if the base model lacks understanding of a domain, ByG cannot reliably edit toward it.
The caption-based supervision provides weaker signals for object removal edits, since target captions omit removed objects without explicitly describing absence, hurting removal performance.
No evaluation on adversarial robustness or security is presented; the method focuses solely on editing quality under standard conditions.
Training details such as exact hyperparameters, epoch counts, and computational resources are not fully disclosed, limiting reproducibility.
The approach assumes access to strong pretrained text-to-image or text-to-video models, which may not always be available or feasible to deploy.

Open questions / follow-ons

How can the method be extended or adapted to explicitly improve object removal and other edit types where captions provide weak supervisory signals?
Can the bootstrapping and gradient routing techniques be generalized to other generative model architectures beyond flow matching, such as diffusion variants or autoregressive models?
What are the robustness properties and failure modes of this approach under distribution shifts or adversarial manipulations at test time?
How does the method scale when leveraging even larger pretrained base models or applied to more complex multi-modal edits involving audio or 3D data?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper introduces a scalable framework to train instruction-based visual editing models without paired supervision, significantly reducing data requirements and expanding applicability to video and rare styles. Understanding ByG’s reliance on intrinsic model knowledge rather than external reward models shows a path toward leveraging pretrained generators as self-supervisors for related generative tasks. The gradient routing approach mitigating train-inference mismatch is also broadly relevant for improving stability and realism in conditional generation tasks. However, limitations on object removal and reliance on strong pretrained checkpoints remind practitioners to carefully evaluate domain coverage and robustness for security-critical applications. Overall, ByG’s unpaired framework highlights a promising direction for developing adaptive visual content transformation tools that can dynamically respond to instructions without expensive annotation efforts.

Cite

bibtex

@article{arxiv2606_03911,
  title={ Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching },
  author={ Yoad Tewel and Yuval Atzmon and Gal Chechik and Lior Wolf },
  journal={arXiv preprint arXiv:2606.03911},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03911}
}

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​