SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

Source: arXiv:2605.22743 · Published 2026-05-21 · By Javad Parsa, Enis Simsar, Amir Joudaki, Thomas Hofmann, André M. H. Teixeira

TL;DR

This paper addresses the challenge of composing multiple customized concepts in text-to-image diffusion models, a task hindered by representation interference when independently fine-tuned modules are combined. Existing solutions either require expensive fusion at inference time or freeze parts of the adaptation parameters, limiting expressiveness and fidelity. The authors propose SeqLoRA, a bilevel optimization framework that jointly trains both low-rank adapter factors (Ai and Bi) while enforcing pairwise orthogonality constraints among learned bases to reduce interference. Theoretically, they provide convergence guarantees and a high-probability bound on catastrophic forgetting, showing that learning the LoRA bases reduces residual interference compared to frozen bases.

Empirically, SeqLoRA significantly improves multi-concept composition fidelity, preserving identity and visual quality across up to 101 concepts, while avoiding costly fusion steps required by prior methods. It also outperforms state-of-the-art baselines on multiple similarity and alignment metrics using the CustomConcept101 dataset. The bilevel optimization approach enables scalable, modular continual learning, making SeqLoRA both theoretically grounded and practically effective for large-scale multi-concept personalization in diffusion models.

Key findings

SeqLoRA achieves state-of-the-art identity preservation on 32 concepts with DINO scores of 0.468 ± 0.021 vs 0.463 for Continual Alternating and 0.436 for Mix-of-Show.
SeqLoRA scales to 101 concepts without out-of-memory errors (OOM) seen in Mix-of-Show and LoRACLR, maintaining stable performance across visual fidelity metrics (Fig 3a).
SeqLoRA's bilevel optimization converges monotonically to a critical point of the constrained problem (Theorem 1).
Learning both LoRA factors (Ai and Bi) reduces residual interference energy more effectively than freezing Bi, per a novel high-probability catastrophic forgetting bound (Theorem 2).
Post-hoc fusion times and complexities are avoided: e.g., Mix-of-Show fusion takes ~33 minutes for 32 concepts while SeqLoRA requires no fusion.
Identity preservation forgetting after sequential learning of 32 concepts is low with CLIP-I dropping only 0.008 ± 0.003 on SeqLoRA (Table 1b).
Orthogonality constraints limit capacity to floor(m/r) concepts per layer, e.g., 192 concepts with layer dimension m=768 and rank r=4, supporting scalability beyond experiments.
Bilevel optimization with cross-Hessian terms yields faster convergence and better stability than alternating minimization baselines (Continual Alternating).

Threat model

The adversary is implicit in the multi-concept generation setting, where independently trained concept adapters interfere destructively if composed naïvely, causing attribute entanglement and forgetting. The threat is representation interference leading to loss of concept fidelity when combining multiple personalized LoRA modules. The attacker cannot access past concept data during new concept training, and the system enforces orthogonality to prevent parameter space overlap, mitigating interference.

Methodology — deep read

Threat Model & Assumptions: The setting assumes a continual learning scenario where concepts arrive sequentially, each with limited local data (reference images per concept), and prior data is discarded. The adversary is implicit in the interference between multiple concept adapters when composed at inference time. The goal is to prevent interference and forgetting without joint retraining of all concepts.

Data: The evaluation uses the public CustomConcept101 dataset with up to 101 personalized concepts. Each concept's reference set pairs images with concept-specific tokens. Splits and preprocessing details align with prior works but are not exhaustively specified, consistent with public benchmarks.

Architecture & Algorithm: The backbone model is Stable Diffusion v1.5. Fine-tuning per concept is performed using Low-Rank Adaptation (LoRA), parametrizing adapted attention weights as a sum of pretrained base plus low-rank residuals W_i = W_0 + A_i B_i^T with rank r. The main novelty is jointly optimizing both factors (A_i, B_i) subject to orthogonality constraints B_j^T B_i = 0 for all previously learned concepts j < i. This orthogonality enforces disentangled subspaces per concept to mitigate interference.

SeqLoRA formulates this as a bilevel constrained optimization problem: the lower level minimizes denoising loss over A_i given B_i, and the upper level optimizes B_i considering its effect on the lower level solution A_i*(B_i). Closed-form projections enforce orthogonality constraints. The algorithm involves iterative gradient-based updates with cross-Hessian terms to incorporate coupling between factors, and orthogonal projection to ensure bases remain orthogonal.

Training Regime: For each new concept, K=3 bilevel iterations are run, each involving 2 inner gradient steps for A_i optimization, using regularization weight 1e-8. Training runs on 2x NVIDIA A100 GPUs with hyperparameters fixed. Total per-concept training time increases relative to alternating minimization but avoids costly fusion steps.

Evaluation Protocol: Metrics include identity preservation (DINO, DINOv2, DINOv3), holistic visual similarity (DreamSim), image-text alignment (CLIP-I, CLIP-T), and human preference scores (HPSv2, HPSv3). Baselines evaluated include Mix-of-Show, Orthogonal Adaptation, LoRACLR, and standard continual alternating minimization. Experiments vary concept count from 8 to 101. Both quantitative metrics and qualitative visualizations were collected.

Reproducibility: Code release is not explicitly confirmed in the text. The experiments rely on public datasets and Stable Diffusion v1.5 base model. Theoretical proofs and algorithm details are fully given. The CustomConcept101 dataset is publicly referenced.

Example: For a given new concept arriving sequentially, SeqLoRA initializes LoRA factors and alternates optimization steps for A_i and B_i using the bilevel updates. After enforcing orthogonality by projecting B_i away from prior bases, the training settles on new factor values that minimize interference-enhanced loss. This yields a new adapter module that can be composed additively with previous adapters for multi-concept generation at inference time without retraining or fusion.

Technical innovations

A bilevel optimization framework for jointly optimizing both LoRA low-rank factors (Ai, Bi) under orthogonality constraints, improving expressiveness without interference.
Closed-form orthogonal projection of newly learned LoRA basis matrices onto the nullspace of previous bases to enforce zero interference subspace composition.
A novel high-probability Hanson-Wright bound modeling layer-wise residual activations as a matrix sub-Gaussian process to theoretically bound catastrophic forgetting.
Efficient implicit computation of cross-Hessian-vector products via nested automatic differentiation to capture coupling between LoRA factors during optimization.
Demonstration that learned adaptive bases outperform frozen random bases by minimizing residual interference aligned with local principal components in feature space.

Datasets

CustomConcept101 — ~500-1000 concepts with per-concept reference image sets — public benchmark for personalized concept generation

Baselines vs proposed

Continual Alternating: DINO = 0.463 ± 0.020 vs SeqLoRA: 0.468 ± 0.021
Mix-of-Show: DINO = 0.436 ± 0.023 vs SeqLoRA: 0.468 ± 0.021
Orthogonal Adaptation: DINO = 0.428 ± 0.023 vs SeqLoRA: 0.468 ± 0.021
LoRACLR: DINO = 0.434 ± 0.023 vs SeqLoRA: 0.468 ± 0.021
At 101 concepts, Mix-of-Show and LoRACLR suffer OOM errors while SeqLoRA maintains stable performance
Fusion time for Mix-of-Show at 32 concepts ~33 mins vs SeqLoRA requires no fusion step

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22743.

Fig 1

Fig 1: Qualitative comparison of multi-concept image generation across different methods for

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Orthogonality constraint imposes an upper bound on number of concepts per layer (e.g., ~192 with typical layer size and rank), which could limit ultra-large scale applications.
Bilevel optimization introduces additional computational overhead per concept compared to simpler alternating minimization, increasing training time (~660s vs ~78s per concept).
Current experiments focus on Stable Diffusion v1.5; applicability to newer or alternative architectures is not yet demonstrated.
Theoretical results rely on mild but standard assumptions about differentiability and distributions; practical scenes might violate these.
Concept ordering in the sequential learning setup may affect final performance; robustness to ordering is not explored.
No explicit adversarial or distribution shift robustness evaluation; only standard continual learning scenarios are tested.

Open questions / follow-ons

How robust is SeqLoRA to different orderings of concepts during sequential adaptation?
Can the bilevel optimization framework be extended to newer or larger diffusion architectures beyond Stable Diffusion v1.5?
What are the limits of concept capacity posed by the orthogonality constraints in even larger scale lifelong personalization?
Can adversarial perturbations or domain shifts be incorporated into the model to test robustness of multi-concept composition?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, SeqLoRA’s approach to composing multiple disentangled, high-fidelity concepts via orthogonal low-rank adaptation could inspire improved mechanisms for generating diverse, multi-attribute synthetic images with controlled attribute mixing. Its bilevel optimization methodology and theoretical bounds on forgetting address catastrophic interference challenges common to modular adaptation approaches. Practically, SeqLoRA’s ability to scale to over 100 personalized concepts without costly fusion offers a promising avenue for building modular, federated personalization pipelines where user-specific adaptations must be combined flexibly without retraining or leakage.

The paper sheds light on the fundamental trade-offs between expressiveness and interference in modular generative model tuning, a challenge analogous to maintaining security boundaries between distributed adversarially trained bot-detection modules or CAPTCHAs incorporating multiple factorized challenge components. SeqLoRA’s theoretical insights and efficient projections may guide future customized CAPTCHA generation frameworks balancing multi-concept modularity, robustness, and computational feasibility.

Cite

bibtex

@article{arxiv2605_22743,
  title={ SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation },
  author={ Javad Parsa and Enis Simsar and Amir Joudaki and Thomas Hofmann and André M. H. Teixeira },
  journal={arXiv preprint arXiv:2605.22743},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22743}
}

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​