FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

Source: arXiv:2606.12012 · Published 2026-06-10 · By Yiqun Ning, Ao Shen, Chenhang He, Lei Zhang

TL;DR

FitVTON addresses a critical gap in virtual try-on (VTON) research where current diffusion-based methods achieve impressive visual realism but fail to produce physically authentic garment fit across diverse body shapes. Unlike prior work that treats try-on primarily as a 2D mask-guided inpainting problem, FitVTON explicitly models the relative garment-body size relationship using parametric garment simulations and structured text prompts encoding fit attributes like tightness and drape. The approach leverages synthetic fitting triplets generated from dynamic cloth-body simulations using GarmentCode and SMPL-X, enabling a flow-matching diffusion model to learn fit-consistent garment deformation patterns. Auxiliary garment and exposed-body mask prediction branches further guide fitting geometry, while a second texture rectification stage adapts the model’s visual realism from simulation to real images. Evaluation on a new real-world benchmark dataset, FittingEffect3K, with a novel vision-language model (VLM)-based fit evaluation protocol and human studies demonstrates that FitVTON achieves more accurate sizing, silhouette preservation, and garment-body alignment than strong academic baselines and a commercial system (Nano Banana), while maintaining competitive image quality.

The key novelty is the use of fit-aware simulation data combined with dual-branch mask supervision to explicitly enforce physically plausible garment fitting behaviors that vary realistically with body shape and pose. This moves beyond mere texture transfer or inpainting-driven appearance realism to authentic try-on that respects underlying garment deformation, tightness, and silhouette changes. The multi-stage training pipeline balances learning geometric fit priors from synthetic data with improving realism on real images without losing fit control. The newly introduced FittingEffect3K dataset and VLM-based fit scoring protocol provide a focused and scalable way to quantitatively assess try-on fitting fidelity.

Key findings

FitVTON achieves an overall fitting consistency score of 3.08 on FittingEffect3K, outperforming Nano Banana (2.82) and other academic baselines (2.30–2.74).
Dual-branch mask supervision improves overall fit score from 2.87 to 3.08, boosting upper-body fit from 2.96 to 3.22, lower-body from 2.79 to 2.99, and dress from 2.81 to 2.90.
Texture rectification reduces FID/KID on DressCode from 11.66/7.72 to 5.21/1.21 and on VITON-HD from 16.15/7.04 to 13.64/4.90, enhancing photorealism without harming fit control.
FitVTON’s use of 78,080 simulation-generated training triplets with 19 garments, 16 body shapes, and 10 poses enables controllable garment-body size variation learning.
The FittingEffect3K real-world evaluation dataset contains 3,350 try-on triplets from 10 participants with varied body shapes and poses, enabling direct fit-oriented assessment.
Human preference studies confirm FitVTON is selected 33.3% of the time over 5 baselines and a commercial model, showing alignment with subjective fit quality.
FitVTON with mask supervision consistently improves garment–body alignment, tightness/looseness consistency, silhouette consistency, and reduces local fitting artifacts compared to ablation variants.
FitVTON’s competitive image realism measures (FID=5.21 on DressCode) indicate that improved fit accuracy does not come at the cost of visual quality.

Threat model

The adversary is the uncontrolled variability of real-world human body shapes, garment sizes, and poses, potentially leading to visually plausible but physically inaccurate try-on results. The model must robustly generalize to diverse bodies without explicit mask annotations at inference, and cannot rely on fixed or average fitting assumptions. There is no consideration of malicious adversaries or attempts to spoof fit prompts.

Methodology — deep read

Threat Model & Assumptions: The adversary is an uncontrolled real-world scenario where bodies and garments vary widely. The model must generalize to natural human shapes and poses without explicit expert mask annotations at test time. The system aims to produce authentic garment fitting consistent with physical deformation, not just plausible image inpainting. No assumptions are made about adversarial inputs or attacks.
Data: The primary training data consists of synthetic triplets generated via a multi-step physics-based cloth simulation using GarmentCode (parametric garment patterns) and SMPL-X (parametric body models). They simulate 19 garment references combined with 16 representative body shapes and 10 pose variations with NVIDIA Warp’s XPBD cloth simulator, creating 78,080 fitting triplets with aligned images and masks. These cover garment-body size variations and wearing styles (one-piece, tucked-in, untucked).

Two real-world datasets, VITON-HD and DressCode, provide additional unpaired real images for texture rectification fine-tuning with approximately 4,000 pseudo triplets.

Architecture and Algorithm: FitVTON builds on the FLUX.1 Kontext multimodal diffusion model, a flow-matching transformer that models image editing as a continuous ODE over latent space. Input is a triplet: instance person image (I_a), garment reference (R_b), and a structured text prompt T_b encoding garment-body size and style.

The VAE encoder encodes images; a text encoder encodes the fit prompt tokens. The latent vectors and prompt tokens are fused in the MMDiT transformer backbone. Training minimizes a conditional flow-matching loss predicting the velocity field driving noisy latent states towards the target latent.

Critical to FitVTON are two auxiliary U-Net heads that predict timestep-dependent masks for the garment region and exposed body regions from simulations. These heads are trained using BCE + Dice segmentation loss only during training, serving as geometric regularizers helping the model to learn fit-conditioned shape changes.

At inference, these mask heads are removed; the model generates images conditioned on fit prompts only.

Training Regime: Stage I trains the flow-matching model with synthetic simulation triplets using image+text LoRA adapters. Auxiliary mask heads are jointly trained.

Stage II texture rectification freezes text LoRA adapters (preserving fit semantics) and fine-tunes only image LoRA layers on real unpaired VITON-HD and DressCode samples. Training mixes pseudo triplets generated by stage one model and real image reconstruction pairs. This improves photorealism without weakening fit control.

Exact hyperparameters (epochs, batch sizes) are in the appendix; training uses GPUs optimized for diffusion models.

Evaluation Protocol: Introduces FittingEffect3K dataset with 3,350 real-world triplets across 10 participants varying in body shape and pose.

Fit metrics decompose into four fit aspects:

Garment-Body Alignment (GB): correctness of garment placement relative to anatomy.
Tightness/Looseness Consistency (T/L): matching garment ease compared to real try-on.
Silhouette Consistency (SC): preservation of overall garment-induced body silhouette.
Local Fit Artifacts (LF): identification of local fitting defects like wrinkles, pulling.

Uses GPT-5.2 based vision-language model to score generated results vs real try-ons on each dimension 1–5, focusing exclusively on fit, ignoring texture.

Complemented with human preference studies and standard image-quality metrics (FID, KID) on public datasets

Reproducibility: Synthetic simulation data generated is based on GarmentCode and SMPL-X parametric models; code and project page available. The FittingEffect3K dataset and detailed evaluation scripts are newly curated for fit assessment (availability unclear but presumably provided or will be).

Example Pipeline: Given a person image, garment image, and textual prompt (e.g., "female, slim, medium-short, untucked"), FitVTON encodes inputs, then the flow-matching model generates a latent try-on result. During training, garment and body mask heads predict segmentation masks to enforce geometric constraints localized around fit-critical garment boundaries and exposed body parts. In stage 2, the image LoRA fine-tunes the model on real data with a mixture of pseudo synthetic triplets and real image reconstruction pairs, producing realistic textures without impairing fit control. The final output reflects realistic garment drape, tightness, and silhouette matched to the body shape described in the prompt.

Technical innovations

Encoding garment-body size relationships as structured text prompts guiding diffusion-based virtual try-on to control fit beyond appearance.
Generating large-scale, controllable synthetic try-on triplets from parametric physics-based garment-body simulation covering multiple body shapes, poses, and garment styles.
Dual-branch mask supervision with auxiliary garment and exposed-body mask predictors during training to spatially ground fit-sensitive shape deformation.
Two-stage training separating fit-prompt geometry learning on synthetic data from texture rectification fine-tuning on real unpaired images using modality-specific LoRA adapters.

Datasets

GarmentCodeVTON — 78,080 synthetic triplets — proprietary simulation combining GarmentCode garments and SMPL-X bodies
VITON-HD — thousands of high-resolution real try-on images — public fashion try-on benchmark
DressCode — thousands of fashion try-on images — public dataset
FittingEffect3K — 3,350 real-world fitting triplets — proprietary human subjects with diverse body types and pose

Baselines vs proposed

CatVTON: overall fit score = 2.30 vs FitVTON: 3.08
Any2AnyTryOn: overall fit score = 2.57 vs FitVTON: 3.08
OmniTry: overall fit score = 2.55 vs FitVTON: 3.08
JCo-MVTON: overall fit score = 2.74 vs FitVTON: 3.08
Nano Banana: overall fit score = 2.82 vs FitVTON: 3.08
FitVTON (w/o mask supervision): overall fit score = 2.87 vs full FitVTON: 3.08
On DressCode FID: JCo-MVTON = 8.84 vs FitVTON = 5.21
On DressCode KID: CatVTON = 1.57 vs FitVTON = 1.21
Removing texture rectification increases FID on VITON-HD from 13.64 to 16.15

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12012.

Fig 1

Fig 1: With garment-body size prompts, Nano Banana [12] produce “neutral fit" results across

Fig 2

Fig 2: Overview of FitVTON. (Top) Given a person image, a reference garment, and a Garment-

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Garment-Body Size control is coarse, based on 16 body prototypes, lacking fine continuous measurement granularity.
Fit control is limited to discrete categories; cannot specify exact garment dimensions or centimeter-level ease.
Synthetic simulation data, despite enhancements, may not cover all real-world fabric/material complexities fully.
Fit evaluation using VLM scores depends on model calibration and may not capture all nuanced human perceptual factors.
Texture rectification relies on unpaired real data and pseudo targets, which may imperfectly bridge sim-to-real domain gap.
No explicit adversarial robustness or out-of-distribution body shape/pose stress tests conducted.

Open questions / follow-ons

How to achieve finer-grained continuous fit control beyond the 16 discrete body prototypes to support centimeter-level garment adjustments?
Can incorporation of more diverse fabric/material simulation improve texture realism and fit behavior under complex deformations?
How robust is the model’s fit control under out-of-distribution poses, heavy occlusions, or unseen garments?
What are the limits and calibration boundaries of VLM-based fit scoring relative to nuanced human perceptual judgments?

Why it matters for bot defense

For bot-defense engineers focused on CAPTCHA, FitVTON’s methodology signals that visual authenticity in image generation can be driven not just by texture fidelity but by physically grounded latent factors such as size and shape interactions. This highlights a broader challenge in synthetic image detectors: confirming structural or physics-consistent features rather than surface-level appearance alone. The approach of leveraging parametric simulations and dual-branch auxiliary tasks to enforce fine-grained geometric constraints may inspire new feature extraction or anomaly detection methods for identifying fake images or bot-generated content with unrealistic physical properties. Furthermore, the use of modality-specific adapters and staged training could inform modular model designs for adaptive generation or detection under variable conditions. Overall, FitVTON underscores the importance of interpretability and explainability tied to physical semantics in generative models, which can aid CAPTCHA systems in discerning human-like consistency versus automated artifacts.

Cite

bibtex

@article{arxiv2606_12012,
  title={ FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control },
  author={ Yiqun Ning and Ao Shen and Chenhang He and Lei Zhang },
  journal={arXiv preprint arXiv:2606.12012},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12012}
}

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​