Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Source: arXiv:2605.30257 · Published 2026-05-28 · By Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss

TL;DR

Stable-Layers addresses the challenge of image layer decomposition—separating an image into multiple RGBA layers for editing—without requiring paired ground-truth supervision. The core novelty is a reinforcement learning framework that fine-tunes a pretrained decomposition model using only feedback scores from a vision-language model (VLM), sidestepping the need for synthetic or annotated data. They build on Qwen-Image-Layered as the base model and apply Flow-GRPO reinforcement learning with LoRA adaptation, sampling multiple candidate decompositions per image and scoring them with a two-phase VLM reward that improves within-group score variance for effective policy updates. This approach simultaneously optimizes semantic layer separation, alpha mask cleanliness, background inpainting, feature distribution across layers, and overall content validity.

Experimental results on the Crello dataset and a held-out LAION-Aesthetics benchmark demonstrate that Stable-Layers produces significantly improved decompositions compared to the base model, yielding fewer blank/artifact layers, stronger semantic separation, and lower per-layer reconstruction error. The two-phase VLM scoring procedure is critical to overcoming compressed reward signals that otherwise limit reinforcement learning progress. Unlike previous supervised or synthetic-target methods, Stable-Layers learns directly from unlabeled images and a black-box VLM judge, generalizing the training recipe for editing-oriented image generators.

The work also contributes a stabilization modification to Flow-GRPO's RatioNorm for high-dimensional latent sequences and demonstrates that prompt conditioning and relative score calibration significantly impact training effectiveness. Qualitative examples show cleaner, semantically distinct layers and plausible background inpainting, confirming the VLM reward captures editing-relevant decomposition quality. Overall, Stable-Layers offers a practical framework to leverage rich VLM critique signals to enhance image decomposition without costly paired data.

Key findings

Stable-Layers reduces bad layers (blank or glazed) per decomposition from ~1.65 to ~0.4 on a held-out LAION-Aesthetics set during training (Fig 4).
Per-layer RGB L1 reconstruction error on the Crello test set improves on all layer counts (2-4 layers) compared to base Qwen-Image-Layered; e.g., mean error at 3 layers decreases from 0.0879 to 0.0767 (Table 1).
Layer 0 (background) quality metrics improve substantially, e.g., Layer 0 inpainting quality rises from ~0.38 to ~0.62 on held-out data (Fig 4, Table 2).
Two-phase VLM scoring with relative grid calibration yields higher Layer 0 SSIM (~0.52 vs 0.45) and quality vs scoring without calibration (Table 4).
LoRA adaptation with rank r=16 and α=16 focused on attention and feed-forward layers suffices to fine-tune with RL while freezing other parameters.
Group Relative Policy Optimization (GRPO) with modified RatioNorm restoring O(1) ratio magnitudes stabilizes training with multi-layer packed latents.
Detailed text prompt conditioning mirroring VLM scoring rubric underperforms simpler basic prompts for policy conditioning (Table 3).
Stable-Layers fills all requested output layers robustly, achieving substantially higher feature distribution evenness (~0.73 vs 0.05) compared to LayerD which often returns single-layer decompositions (Table 2).

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled, as this is a model refinement approach rather than a defensive mechanism. Assumed is access to a pretrained layer decomposition model (Qwen-Image-Layered) and a fixed proprietary vision-language model (VLM) that can score image decomposition quality. The framework assumes the VLM cannot be tuned or altered but its score distribution may drift over versions.

Data: Training uses only unlabeled images from Fine-T2I, a large-scale filtered dataset of photographs and artworks. Images are resized to 640x640 and normalized to [-1,1]. The model learns to output 2-5 RGBA layers per image during training, sampled uniformly per step. Held-out evaluation uses the Crello layered image dataset with ground-truth PSD layer annotations for quantitative metrics, and a 480-image LAION-Aesthetics subset for qualitative benchmarks.

Architecture/Algorithm: Starting with the pretrained Qwen-Image-Layered model, which combines a 3D RGBA VAE compressing layers spatially with a flow-matching transformer operating on tokenized latent sequences plus prompt conditioning via a text encoder. To fine-tune, they apply Low-Rank Adaptation (LoRA) with rank r=16 and scaling α=16 on attention projection and feed-forward layers, freezing other parameters.

Training utilizes Flow-GRPO, an SDE-augmented variant of flow matching providing tractable per-step log probabilities for reinforcement learning. Group Relative Policy Optimization (GRPO) is employed with a novel sum-and-rescale variant of RatioNorm that maintains O(1) magnitude importance ratios despite high-dimensional packed latent sequences, preventing signal collapse. The policy samples a group G of candidate decompositions via SDE flow matching per image, which are then scored by the VLM reward.

Reward Design: The VLM reward is structured as a two-phase protocol. Phase 1 scores each candidate independently by rubric across five 0-5 criteria: semantic separation, alpha cleanliness, background inpainting, feature distribution, content validity. Scores are normalized to [0,1]. Phase 2 performs relative calibration: candidates are tiled in grids for side-by-side VLM rescoring conditioned on Phase 1 scores, boosting within-group score variance necessary for GRPO advantage normalization and learning.

Training Regime: The replay buffer stores trajectories of SDE samples and log probabilities. For each training step, they sample G candidates, score per the two-phase reward, compute within-group normalized advantages, and update LoRA parameters with a clipped surrogate policy gradient with KL regularization omitted via gradient reweighting. Typical group size G and batch sizes per step are not explicitly detailed. Training runs for 200 steps were used in ablations and longer in production. The base model supports up to 20 layers but training is capped at 5 layers for compute efficiency.

Evaluation: Metrics include per-layer RGB L1 reconstruction error against ground-truth layers with best-match assignment on Crello test set stratified by output layer count, number of bad layers (blank or artifact-heavy), feature distribution evenness, and background inpainting quality on the LAION-Aesthetics set. Qualitative comparisons illustrate semantic separation and background plausibility. Baselines include the base Qwen-Image-Layered checkpoint and Flow-GRPO fine-tuning without grid calibration. The LayerD model is compared to illustrate differences in decomposition strategies.

Reproducibility: The authors defer full architecture details to referenced papers (Qwen-Image-Layered, Flow-GRPO, LoRA). Complete reward prompt rubrics and calibration grid details are provided in appendices. Training hyperparameters and schedules appear in supplemental but code or pretrained weights are not explicitly stated as released. The VLM used is proprietary with fixed snapshot dependency.

Concrete Example: For an input image, the system samples G candidate 2-5 layer decompositions by integrating the SDE flow with the current policy (LoRA-adapted transformer). Each candidate's N layers are composited onto white backgrounds and individually scored against the rubric (semantic separation, alpha cleanliness, etc.) by the VLM. Then the candidates are tile-arranged into a comparison grid and rescored relative to one another, yielding calibrated scores. These scores yield group-relative advantages normalized within G, which weigh the clipped policy gradient update for the LoRA parameters. This procedure iterates, progressively improving decomposition quality metrics such as reduced blank layers and better background inpainting.

Technical innovations

Two-phase VLM scoring protocol combining rubric-based absolute per-sample scores with a relative grid calibration step that restores within-group variance for stable GRPO policy updates.
Modified RatioNorm for GRPO that sums and rescales log-probabilities over spatially packed RGBA latent sequences, maintaining effective O(1) importance ratio variance in high-dimensional flow matching RL.
Application of LoRA adaptation focused on attention and feed-forward layers of a flow matching transformer to enable efficient RL fine-tuning without full model retraining.
Direct reinforcement learning fine-tuning of a layer decomposition model using black-box VLM feedback instead of paired supervision or synthetic targets, enabling training on unlabeled real images.

Datasets

Fine-T2I — large-scale filtered dataset of photographs and artworks — used for unlabeled RL fine-tuning training.
Crello — layered image dataset with ground-truth PSP layers — used for quantitative evaluation of reconstruction error.
LAION-Aesthetics subset (n=480) — held-out set for qualitative evaluation of bad layers, feature distribution, and background inpainting.

Baselines vs proposed

Qwen-Image-Layered (base model): mean per-layer RGB L1 error = 0.0879 (3 layers) vs Stable-Layers: 0.0767
Flow-GRPO without calibration: Layer 0 SSIM ≈ 0.45 vs Stable-Layers with two-phase calibration: ≈ 0.52 (Table 4)
LayerD: feature distribution evenness = 0.0585 vs Stable-Layers: 0.7339 (Table 2)
LayerD: Layer 0 inpainting quality = 0.7136 vs Stable-Layers: 0.6148 (Table 2) reflecting different design tradeoffs

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30257.

Fig 1

Fig 1: Stable-Layers. We finetune a layer decomposition model using Flow-GRPO and a VLM

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Relies on a proprietary fixed VLM as reward model, incurring API cost and vulnerability to score drift across model versions.
Evaluation is primarily via automated quantitative metrics and qualitative visual inspection; no direct human user study or editing utility assessment performed.
Training only performed with up to 5 output layers due to compute limitations, while base model supports up to 20; behavior on higher layer counts is not validated.
Unclear how robust the method is to varying VLM architectures or domains outside photographic/artistic images tested.
The grid calibration step introduces extra computation and complexity during training that may limit scalability or speed.
Limited ablation on group size G and impact on reward signal quality and training stability is reported.

Open questions / follow-ons

Can the two-phase VLM reward design be automated or adapted to new criteria without manual rubric construction?
How does stable-layers fine-tuning perform on higher layer count decompositions (e.g., up to 20 layers) and more complex scenes?
What is the robustness of the approach across different VLMs or open-source vision-language models, including score stability over time?
Can the relative grid calibration be replaced by more efficient pairwise preference learning or token-level logit-based rewards to improve scalability?

Why it matters for bot defense

This work demonstrates a compelling technique for applying vision-language model feedback as a reward signal to improve complex image decompositions without explicit supervision. Bot-defense and CAPTCHA engineers may find the two-phase VLM scoring and reinforcement learning approach relevant for designing systems that learn from black-box multi-criteria judges without direct ground-truth labels. The method of creating richer within-group reward variance via relative calibration can inspire more stable and sample-efficient learning in generative or discriminative tasks where absolute scalar reward signals collapse.

While Stable-Layers focuses on image editing decomposition, its general framework—fine-tuning conditional generators using VLM-provided multi-dimensional rubric scores evaluated jointly—can be adapted for CAPTCHA challenges involving layered image generation, semantic disentanglement, and fine-grained visual separation. Understanding how to construct and stabilize VLM-derived feedback signals is crucial for reliable training in adversarial or semi-supervised bot-detection scenarios where labeled paired inputs are scarce or unavailable.

Cite

bibtex

@article{arxiv2605_30257,
  title={ Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning },
  author={ Ciara Rowles and Reshinth Adithyan and Nikhil Pinnaparaju and Vikram Voleti and Mark Boss },
  journal={arXiv preprint arXiv:2605.30257},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30257}
}

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​