FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Source: arXiv:2606.20506 · Published 2026-06-18 · By Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang et al.

TL;DR

FreeStyle addresses the challenging problem of style-content dual-reference image generation, where a synthesized image must preserve the content and structure of a content reference image while adopting the style of a separate style reference image, all also following a guiding text instruction. This task is difficult because of the need to disentangle content and style information and avoid semantic leakage where content from the style reference corrupts the output. A major obstacle has been the lack of large-scale, diverse triplet datasets with clean content-style separation and high style diversity. FreeStyle innovates by mining the open-source community’s Low-Rank Adaptation (LoRA) weights as compositional anchors for styles and content, extracting a massive dataset of Style-Reference and Content-Reference triplets from multiple base models. This enables broad long-tail style coverage and approximately 478k filtered triplets. To combat content leakage, FreeStyle employs a two-stage curriculum with stage-specific disentanglement: first, an attention-level enrichment constraint suppresses over-attention to style tokens during style transfer stage, and second, a frequency-aware Rotary Positional Embedding (RoPE) modulation reduces patch-level copying from style in dual-reference generation. The authors also introduce a benchmark with novel metrics such as a style-invariant Content Alignment Score (CAS) and a Vision Language Model (VLM) based Verification Score to evaluate style fidelity, content preservation, instruction following, and leakage rejection. Extensive experiments demonstrate that FreeStyle attains a strong empirical balance across these axes and surpasses prior methods in content-style dual-reference image generation.

Key findings

FreeStyle mining pipeline generates and filters over 478k content-style dual-reference triplets spanning FLUX (273k), Illustrious (172k), and Qwen (33k) base models.
In style-reference generation (Stage 1), the attention-level enrichment constraint limits style token attention mass to not exceed 0.6 (normalized by token group size) in late denoising steps, reducing semantic leakage.
Entropy-based regularization keeps style-reference attention distributions neither collapsed nor uniform, constrained between 0.06 and 0.14 normalized entropy, preserving style diversity.
In dual-reference generation (Stage 2), frequency-aware RoPE modulation applies positional embedding scaling with shf < 1 for high-frequency terms and slf > 1 for low-frequency terms, suppressing patch-level copying that caused style content leaks.
Filtering LoRA combinations yields ~0.4 success rate for combined stable content and style LoRA pairs, ensuring reliable dual-reference triplet quality.
Bilateral consistency filtering with DINOv2 content similarity > 0.87 and ONEIG style similarity > 0.92 verified clean style-transfer triplets during Stage 1 data collection.
The benchmark introduces Content Alignment Score (CAS) and VLM-based Verification Score distinguishing style similarity, content preservation, and leakage, allowing nuanced quantitative trade-offs analysis.
Attention maps reveal leakage cases exhibit broader and persistent style reference attention bands during late denoise steps versus focused bands in leakage-free cases (Fig 5).

Threat model

The adversary is the style reference input which can leak semantic content (objects, layout) into the generated image undesirably, undermining the preservation of the content reference. The model assumes the attacker cannot corrupt the content reference and that the text instruction is trusted. The threat focuses on suppressing cross-reference leakage without compromising style transfer quality.

Methodology — deep read

The authors approach style-content dual-reference generation with a carefully designed framework consisting of data mining, model training, and benchmarking:

Threat Model & Assumptions: The adversary is an untrusted style reference that might inject semantic content leakage through the style branch, causing unwanted copying into the generated image, violating content fidelity. The model cannot allow style reference semantics to replace content structure.
Data Collection & Processing: They first gather a large corpus of content images from public websites covering broad categories (animals, humans, scenes, objects). For Stage 1 style-transfer data, they leverage Nano Banana Pro to stylize content images into ~600 validated style triggers. Outputs are filtered via bilateral consistency checks: content similarity > 0.87 (DINOv2 features) and style similarity > 0.92 (ONEIG encoder). Style references are sampled from different sources to ensure content-style independence. For Stage 2, community LoRA mining is performed by crawling 68.6k LoRA weights from platforms like Civitai and TensorArt. LoRAs are categorized as style or content via human expert validation on 3x3 preview grids (stable if ≥7/9 images are consistent). Subsequently, LoRAs are ranked by aesthetic score and sampled non-uniformly. ComfyUI workflows batch-generate 20+ images per LoRA using a 40k-prompt vocabulary, with generated images verified by Qwen3-VL vision-language model for quality and content/style consistency. Valid style-content LoRA pairs are filtered via bilateral content and style consistency checks, yielding ~40% combination success rate. Finally, target images are synthesized from combined LoRAs and filtered again for consistency.
Architecture & Training: The base model is a Qwen-Image variant with MMDiT blocks and flow-matching diffusion loss. Inputs include text prompt, content-reference, and style-reference images concatenated as multi-image input. Training is split into two distinct stages:

Stage 1 (style-reference training): Focuses solely on style-transfer data. To reduce semantic leakage from style reference, an attention-level enrichment constraint is introduced. They compute group-wise attention enrichment scores for style-reference tokens normalized by group size at the first transformer block during denoising timesteps, and penalize over-attention beyond 0.6 using a two-sided squared hinge loss weighted toward late denoising steps. To preserve attention distribution diversity over style tokens, an entropy loss constrains normalized entropy of style attention per query between 0.06 and 0.14. The final loss combines flow-matching diffusion loss with these two attention regularizers weighted at 0.1.
Stage 2 (dual-reference training): Combines LoRA-mined dual-reference triplets with style-transfer data. Content reference now absorbs substantial attention, changing leakage dynamics. Instead of attention magnitude, leakage arises from patch-level copying encoded by high-frequency Rotary Positional Embeddings (RoPE). They implement frequency-aware RoPE modulation by scaling style-reference positional embeddings with a smooth frequency-dependent curve that suppresses high-frequency components (<1 scaling) and amplifies low-frequency ones (>1 scaling), parametrized by hyperparameters shf, slf, and a smoothness β. This modulation is fixed for all denoising steps and applied only to the style-reference branch.

Evaluation Protocol: They introduce a benchmark with standardized prompts, fixed reference sets, and multiple open-source base models. Metrics include:

Content Alignment Score (CAS) adapted from CSGO, measuring structural agreement invariant to style,
Style similarity scores,
Aesthetics scoring,
Instruction following,
VLM-based Verification Score that separately quantifies leakage rejection by checking for semantic copying from style reference.

Statistical tests or cross-validation details were not specified. Comparative baselines include prior style-reference and dual-reference methods, though explicit numbers are sparse in the summary.

Reproducibility: Code, weights, dataset, and benchmark will be publicly released, along with ComfyUI workflows. The mined LoRA dataset is partially publicly accessible via community platforms but filtered. The detailed filtering and generation parameters are transparently documented.

Concrete example: In Stage 1, a landscape content image is stylized into e.g. American Tattoo Art style using Nano Banana Pro with the prompt template “transfer into American Tattoo Art style.” The output is filtered by verifying content similarity to the source (DINOv2, threshold 0.87) and style similarity to anchor style images (ONEIG, threshold 0.92). This filtered triplet (content, style reference, stylized target) is used to train the diffusion model with attention-level enrichment loss limiting style token attention to no more than 0.6 fraction at late denoise steps, yielding improved style fidelity while suppressing semantic leakage.

This thorough stepwise approach with community data mining, attention-based regularization, and frequency-aware positional modulation distinguishes FreeStyle’s methodology.

Technical innovations

Mining community LoRA weights as stable, compositional anchors to construct large-scale style-content dual-reference triplets with broad style and content diversity.
Design of a two-stage training curriculum that addresses distinct semantic leakage pathways via stage-specific disentanglement: attention-level enrichment constraint in style-transfer stage and frequency-aware RoPE modulation in dual-reference stage.
Introduction of group-wise attention enrichment scores normalized by token group size to quantitatively capture and constrain disproportionate style-reference attention driving semantic leakage.
A benchmark with style-invariant Content Alignment Score (CAS) and novel VLM-based Verification Score enabling finer-grained evaluation of style fidelity, content preservation, instruction compliance, and leakage rejection in dual-reference generation.

Datasets

FreeStyle Style-Transfer Dataset — 600k+ triplets collected via Nano Banana Pro stylization and bilateral consistency filtering — constructed from public web images and style triggers
FreeStyle Dual-Reference Dataset — approximately 478k curated triplets mined from 68.6k community LoRA weights filtered on three base models (FLUX 273k, Illustrious 172k, Qwen 33k) — mined from open-source LoRA repositories Civitai, TensorArt
Open Images Dataset — used as a taxonomy reference to categorize content vocabulary during community LoRA mining

Baselines vs proposed

Baseline: Nano Banana Pro stylized images with no attention constraint — higher content leakage than proposed attention-level enrichment constrained model.
Baseline: Style-transfer generation without frequency-aware RoPE modulation — demonstrates increased patch-level style content leakage vs proposed dual-stage model.
CSGO Content Alignment Score (CAS) improves moderately from prior methods to FreeStyle, exact numeric values not specified.
Verification Score from VLM shows effective suppression of semantic leakage in FreeStyle compared to no disentanglement variants, details summarized in Fig 5.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20506.

Fig 1

Fig 1: Overview of FreeStyle. 1 We collect community-created style and content LoRAs from multiple platforms and automatically

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

The reliance on community LoRA models may inherit biases or instabilities intrinsic to those models, despite filtering.
The two-stage disentanglement strategy addresses specific leakage modes but may not fully generalize to arbitrary style-content references or unseen domains.
Benchmark evaluation, while more comprehensive, may lack evaluation under strong adversarial content-style leakage attempts or extreme distribution shifts.
Code release timing may limit immediate community reproducibility; closed-source base models (e.g. Qwen-Image) may restrict replication.
Statistical significance or robust cross-validation results were not detailed in the public text.
Filtering success rates of 0.4 for combined LoRA pairs indicate substantial resource consumption and potential dataset gaps.

Open questions / follow-ons

Can the disentanglement techniques generalize to video or multi-frame style-content reference generation settings?
How effective is FreeStyle against adversarially crafted style references designed explicitly to induce leakage?
Could learnable or adaptive RoPE modulation dynamically tuned per input further improve leakage suppression?
What is the trade-off between leakage control and creative stylization diversity at extreme style-content mismatches?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, FreeStyle’s work on disentangling content and style references provides valuable insights into multi-modal conditional image generation challenges. The proposed two-stage disentanglement approach illustrates effective methods for controlling semantic leakage and ensuring fidelity of intended visual input signals while preventing unintended content propagation. Such rigorous control could inform techniques for detecting or mitigating adversarially crafted synthetic images used to circumvent visual challenge-response tests or generate adversarial inputs. Additionally, the community LoRA mining pipeline demonstrates scalable methods of building broad, diverse datasets with fine-grained control signal separation, which could be leveraged for training robust CAPTCHA models requiring compositional image understanding. The benchmark’s introduction of content alignment and leakage verification metrics also suggest rigorous evaluation protocols that bot-defense systems might adopt to better distinguish genuine user-generated content from synthesized or tampered images.

Cite

bibtex

@article{arxiv2606_20506,
  title={ FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining },
  author={ Jinghong Lan and Wei Cheng and Yunuo Chen and Ziqi Ye and Peng Xing and Yixiao Fang and Rui Wang and Yufeng Yang and Xuanyang Zhang and Xianfang Zeng and Difan Zou and Gang Yu and Chi Zhang },
  journal={arXiv preprint arXiv:2606.20506},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20506}
}

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​