Skip to content

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

Source: arXiv:2606.13427 · Published 2026-06-11 · By Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

TL;DR

VietFashion addresses the challenge of fine-grained, culturally grounded sketch-text composed image retrieval (ST-CIR) specifically for traditional Vietnamese Ao Dai garments. Existing retrieval benchmarks focus on Western fashion and either single-modality queries or single-target retrievals, limiting their utility for capturing nuanced cultural semantics and inherent retrieval ambiguities. VietFashion innovates by providing a dataset combining 650 hand-drawn Ao Dai sketches with finely curated textual attributes extracted from authentic fashion magazines, paired to over 21,000 photorealistic synthesized images generated via a novel pipeline that fuses generative models with attribute-driven prompts.

The dataset adopts a multi-target retrieval formulation where each sketch-text query corresponds to three valid target images to better model real-world design variability and reduce false-negative supervision during evaluation and training. The authors benchmark multiple state-of-the-art retrieval paradigms, including sketch-only, sketch-text composed, supervised composed, and zero-shot methods, revealing significant performance gaps, especially in distinguishing subtle cultural details and compositional semantics. VietFashion sets a new high bar and provides a culturally specific testbed to improve multimodal retrieval algorithms capable of fine-grained cultural fashion understanding.

Key findings

  • The dataset contains 650 hand-drawn Ao Dai sketches and over 21,000 generated photorealistic images paired via detailed attribute captions.
  • Each sketch-text query is paired with 3 valid target images, enabling multi-target retrieval to reduce false-negative errors of prior single-target benchmarks.
  • BLIP4CIR achieved the best supervised performance with Recall@1 = 8.77%, Recall@10 = 37.03%, mAP = 0.2483, MRR = 0.3950 on the test split.
  • CLIP4CIR performs competitively but with significant margin behind BLIP4CIR, e.g., MRR 0.1908 vs 0.3950, underscoring the difficulty of the task.
  • Zero-shot and textual inversion methods like SEARLE and Pic2Word perform substantially worse than supervised baselines, showing limited generalization.
  • Sketch-only retrieval methods (ZSE-SBIR, S3BIR-DINO) yield the lowest recall and ranking scores, showing textual attribute queries add essential semantic information.
  • Sketch abstraction levels and structural complexities were analyzed, with the dataset roughly balanced across low, medium, and high abstractions to improve generalization.
  • The multi-target design is inspired by CIRCO and helps to model the ambiguity of real-world retrieval where multiple valid variations exist.

Threat model

The assumed adversary is a retrieval system user who issues hand-drawn sketch and text queries to search for culturally authentic Ao Dai images. Implicitly considered is retrieval ambiguity arising from multiple visually similar valid results per query, motivating a multi-target ground truth setup. Malicious adversaries or attacks are not explicitly modeled; the focus is on realistic design ambiguity and retrieval accuracy rather than robustness to adversarial manipulations.

Methodology — deep read

The authors begin by defining a threat model aligned with retrieval tasks for cultural fashion: users supply multimodal queries combining hand-drawn sketches and text to find culturally authentic Ao Dai images. The adversary is not explicit, but retrieval ambiguity and false negatives in ground truth motivate multi-target design.

The dataset construction proceeds in multiple stages. First, a vocabulary of garment attributes (color, fabric, silhouette, neckline, sleeves, embellishments, background, lighting, cultural context) is curated from Vietnamese fashion magazines and archives to ensure cultural authenticity. This rich attribute set allows nuanced textual descriptions.

Next, 650 hand-drawn sketches of Ao Dai garments are collected to represent structural variation with diverse abstraction and complexity levels. These sketches form the structural query modality.

Compositional queries are synthesized by randomly sampling attribute combinations to generate captions using Qwen-2.5 3B instruction-tuned LLMs. Input is a structured attribute list fed into the LLM to create neutral, factual captions focused solely on garment visual attributes.

For target image generation, the authors integrate sketches with attributes via the SANA-ControlNet image synthesis model, producing photorealistic images matching the compositional prompts. For each sketch-caption pair, 3 distinct images are generated varying lighting, textures, and minor details, preserving cultural garment identity.

The final data triples are (Sketch, Caption, {3 target images}), supporting multi-target retrieval scenarios to better encode ambiguity and variability.

The dataset is split by sketches into train (5,200 queries), validation (650 queries), and test (1,150 queries), avoiding query leakage.

Four retrieval paradigms are benchmarked: (1) Sketch-Based Image Retrieval (SBIR) with ZSE-SBIR and S3BIR-DINO; (2) Sketch-Text Composed Image Retrieval (ST-CIR) with TaskFormer and VaGFeM; (3) supervised composed retrieval using CLIP4CIR and BLIP4CIR, which fuse sketch and text embeddings and are trained on triplet supervision; (4) zero-shot and textual inversion methods including SEARLE variants and Pic2Word, which do not use explicit triplet supervision but adapt foundation models.

Training regimes and hyperparameters are not fully detailed in the excerpt but appear conventional for large retrieval benchmarks, with evaluation focusing on Recall@K, mAP, and MRR metrics, using multi-target ground truths.

One example end-to-end: a sketch depicting an Ao Dai silhouette with certain abstract details is paired with a textual caption generated from attribute lists describing fabric, color, collar, sleeves, etc. Using SANA-ControlNet, three photorealistic Ao Dai images are generated matching this structured description. During evaluation, the model retrieves and ranks candidate images given the sketch-text input and accuracy metrics are computed considering all three targets as correct positives.

The dataset and code are publicly released, enabling reproducibility and further research.

Overall, the methodology carefully blends curated cultural knowledge, generative modeling, and multi-modality to produce a challenging, realistic sketch-text retrieval benchmark emphasizing cultural garments.

Technical innovations

  • A novel multi-target retrieval design for sketch-text composed queries in fashion, reducing false-negative supervision and modeling real-world design ambiguity.
  • Integration of a generative pipeline combining Qwen-2.5 LLMs for fine-grained textual captioning with SANA-ControlNet for photorealistic fashion image synthesis.
  • Curation of a culturally specific attribute vocabulary sourced from authentic Vietnamese fashion magazines to ensure precise cultural semantics in textual modifiers.
  • Comprehensive benchmarking across sketch-only, sketch-text composed, supervised, and zero-shot retrieval paradigms on a culturally grounded fashion dataset.

Datasets

Baselines vs proposed

  • BLIP4CIR: Recall@1 = 0.0877, Recall@10 = 0.3703, mAP = 0.2483, MRR = 0.3950 vs other baselines significantly lower on VietFashion test set
  • CLIP4CIR: MRR = 0.1908 vs BLIP4CIR MRR = 0.3950
  • SEARLE-ViT/B zero-shot: performance degraded relative to supervised models (exact metrics not specified)
  • ZSE-SBIR (sketch-only): lowest retrieval metrics, signifying sketch input alone insufficient for challenging cultural garment retrieval

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13427.

Fig 1

Fig 1: Representative sketch-photo pairs of Ao Dai from our VietFashion dataset. The top row features sketches reflecting

Fig 2

Fig 2: Overview of the VietFashion dataset construction pipeline. The pipeline begins with sketches (S) and sampled garment

Fig 3

Fig 3: Examples from the proposed VietFashion dataset. Each query contains a sketch of an Ao Dai, a natural-language

Fig 4

Fig 4: Triplet ambiguity in CIR. A single query may cor-

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

  • Synthesized images may not fully capture the photographic diversity and complexity of real-world Ao Dai garments despite their photorealism.
  • The dataset focuses exclusively on a single cultural garment (Vietnamese Ao Dai), limiting generalization to other traditional clothing types or cultures.
  • While multi-target retrieval addresses some ambiguity, ground truth variation remains synthetically generated and may miss subtle visual cues present in real fashion photography.
  • The evaluation does not appear to include explicit adversarial retrieval robustness or distribution shift analyses (e.g., unseen attribute combinations or new sketch styles).
  • Detailed training methodology, including hyperparameter settings and training stability for generative models, is not fully disclosed, affecting reproducibility nuances.

Open questions / follow-ons

  • How well do models trained on VietFashion generalize to real-world photos or sketches beyond synthetic data?
  • Can the multi-target semantic retrieval framework be extended to other cultural garment datasets and broader fashion domains?
  • What architectural or training modifications improve fine-grained semantic understanding of subtle cultural motifs in multi-modal composed retrieval?
  • How robust are current methods against adversarial or noisy sketches that might be common in practical sketch-based retrieval scenarios?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, VietFashion illustrates the complexity of multi-modal (sketch-text) compositional queries in a highly nuanced cultural domain. The multi-target retrieval setting highlights challenges in ambiguous, multi-valid-output retrieval scenarios, emphasizing the importance of handling subtle semantic differences and multi-modal information integration.

Applying VietFashion insights could inspire CAPTCHA systems that leverage fine-grained multi-modal input to verify human intent, such as requiring the interpretation of sketches combined with textual instructions for cultural artifacts. The demonstrated gaps in zero-shot and sketch-only models suggest opportunities to design CAPTCHAs that exploit these weaknesses in automated models by using culturally specific, compositional, and ambiguous queries. However, limitations in dataset diversity and synthetic data caution against overfitting CAPTCHA defenses to narrowly scoped benchmarks.

Cite

bibtex
@article{arxiv2606_13427,
  title={ VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits },
  author={ Hoang-Nguyen Cao and Le-Hoang Bui and Dinh-Khoi Vo and Minh-Triet Tran and Trung-Nghia Le },
  journal={arXiv preprint arXiv:2606.13427},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13427}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution