Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

Source: arXiv:2606.13625 · Published 2026-06-11 · By Vinícius Orrú, Bruno H. Foggiatto, Gabriel E. Lima, David Menotti, Rayson Laroca

TL;DR

This paper tackles the problem of vehicle color recognition (VCR) in real-world surveillance settings characterized by severe long-tailed class imbalance among vehicle colors. Accurate VCR is critical when license plates are unreadable, but imbalanced data leads to models biased toward frequent colors like white and black, neglecting rare but operationally important colors. The authors present a comprehensive empirical study using the challenging UFPR-VeSV dataset, combining two synthetic minority-class augmentation methods—text-conditioned image generation (RunDiffusion/Juggernaut-XL) and image-conditioned editing (Gemini 2.0 Flash)—with modern visual representations, imbalance-aware training techniques, constrained color-preserving augmentations, foreground-aware preprocessing, and ensemble fusion.

The integrated approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 points over previous state-of-the-art methods. Results show that synthetic data combined with weighted cross-entropy loss and learning-rate scheduling significantly improves performance on minority classes. Manual error analysis reveals that many misclassifications arise from inherent visual ambiguities in the surveillance imagery (e.g., low illumination, occlusions), underscoring the practical limits of color-based VCR. The paper demonstrates that domain-aligned synthetic augmentation plus careful imbalance-focused training provide substantial practical gains for long-tailed VCR in surveillance scenarios.

Key findings

The proposed approach achieves 94.6% micro accuracy and 79.7% macro accuracy on UFPR-VeSV, improving macro accuracy by 8.2 percentage points over prior work.
Gemini 2.0 Flash synthetic augmentation yields higher macro accuracy gains (+75.7%) than RunDiffusion (+71.5%) despite generating fewer images, emphasizing domain alignment over sheer volume (Table 4).
Weighted cross-entropy loss combined with curated synthetic images stabilizes training and improves minority-class recognition, increasing macro accuracy from 59.3% to over 75%.
Adding color-safe augmentation and foreground-aware preprocessing further improves macro accuracy by ~1.4 percentage points cumulatively (Table 5).
Hard-voting ensemble fusion of three DINOv3-Large variants boosts macro accuracy to 79.7% and macro F1 to 78.8%.
Manual error analysis of 785 errors shows that 58.5% are inherently ambiguous, visually indistinct even to human annotators, highlighting fundamental limits of VCR from surveillance images.
Correctable model errors cluster around gray→black and silver→white confusions, often due to low contrast and shadows.
The dataset exhibits extreme imbalance, e.g., 7,381 white samples vs only 34 brown samples, leading to the need for focused minority-class augmentation.

Threat model

n/a — The paper does not explicitly consider adversarial threats; the focus is on operational challenges in vehicle color recognition from surveillance footage with severe class imbalance and difficult imaging conditions.

Methodology — deep read

Threat Model & Assumptions: The study focuses on vehicle color recognition from real-world static surveillance imagery under challenging acquisition conditions (motion blur, occlusion, low light). The adversary is not explicitly defined in a security context; rather, the work addresses operational challenges of visual ambiguity and class imbalance affecting automated recognition reliability. The assumption is that the model only trains on labeled images, with no adversarial input or active attacks.
Data: The primary dataset is UFPR-VeSV, a real-world long-tailed surveillance collection with 24,945 images of 16,297 unique vehicles annotated into 13 color classes plus an 'unknown' category. The color distribution is highly skewed (e.g., 7,381 white vs 34 brown). The dataset features challenging conditions including day/night, viewpoints, weather, motion blur, and occlusions. Data splits follow a stratified 3:1:1 train-validation-test ratio consistently across 10 random folds.
Synthetic Augmentation: Two off-the-shelf generative models create synthetic images for minority classes. RunDiffusion/Juggernaut-XL performs text-conditioned image generation combining vehicle make/model, target color, environment, and viewpoint prompts to yield diverse photorealistic images. Gemini 2.0 Flash conditions on an existing UFPR-VeSV image and textual instructions to recolor the vehicle while preserving pose and background. Generated images undergo manual double-annotator quality control for visual realism and correct color, retaining 67.7% of RunDiffusion and 42.9% of Gemini samples.
Model Architecture & Training: Four supervised baselines (EfficientNet-V2, ResNet-50, Swin-Tiny, ViT-B/16) initialized from ImageNet, plus frozen DINOv3 (ViT-Small/Base/Large) representations feeding an MLP classifier, are evaluated. Optimization uses Adam with early stopping up to 100 epochs. To handle imbalance, weighted cross-entropy (WCE) loss with inverse class frequency weighting is compared against standard cross-entropy (CE). Learning-rate schedulers tested include Cosine Decay (CD), Linear Warmup + Cosine Decay (LWCD), and ReduceLROnPlateau.
Augmentation & Preprocessing: Conservative color-safe augmentations restrict hue/saturation/value shifts to small ranges to avoid corrupting the target color signal. Foreground-aware preprocessing using SAM 2 segments the vehicle mask and blurs the background to reduce context bias.
Evaluation Protocol: Performance is measured using micro accuracy (overall) and macro accuracy/F1 (per-class average), averaging results over 10 folds on held-out real test images only. Synthetic images augment training but are excluded from validation/testing. Ensemble fusion combines top models by hard voting.
Reproducibility: The authors provide the synthetic images and source code at their GitHub repository. UFPR-VeSV is non-public with restricted access. Model checkpoints are not explicitly mentioned. The generation uses off-the-shelf models without fine-tuning, enabling replication.

Example end-to-end: For minority class 'brown' with 34 real images, synthetic images were generated by RunDiffusion and Gemini; after manual filtering, these substantially augmented minority samples. Training with this combined dataset and weighted cross-entropy loss enabled DINOv3-Large to improve macro accuracy on brown and other rare colors, demonstrated through cross-validated test results and paired error analysis differentiating inherent ambiguity from correctable model errors.

Technical innovations

Application of off-the-shelf text- and image-conditioned generative models for targeted minority-class synthetic augmentation in highly imbalanced vehicle color recognition.
Combination of synthetic data with weighted cross-entropy loss and learning-rate warmup schedules to stabilize training and improve minority class recognition.
Introduction of color-safe augmentation constraints and foreground-aware preprocessing to preserve vehicle color signals and reduce background bias.
Demonstration that domain alignment of synthetic images (e.g., Gemini editing) is more impactful than sheer volume of synthetic samples (RunDiffusion) in this application.

Datasets

UFPR-VeSV — 24,945 images — real-world surveillance dataset from Military Police of Paraná, Brazil

Baselines vs proposed

EfficientNet-V2 baseline (original data): micro-accuracy = 93.5%, macro-accuracy = 71.5%, F1 = 73.8%
DINOv3-Large baseline (original data): micro-accuracy = 94.0%, macro-accuracy = 72.5%, F1 = 74.4%
DINOv3-Large + Gemini synthetic + weighted cross-entropy + augmentation + foreground + ensemble: macro-accuracy = 79.7%, micro-accuracy = 94.6%, F1 = 78.8% (8.2 points macro gain over prior)
Weighted cross-entropy alone on long-tailed data reduces macro accuracy to 59.3%, but combined with Gemini synthetic improves macro accuracy to 75.7% (Table 4).

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.13625.

Fig 1

Fig 1 (page 1).

Fig 2

Fig 2: Representative images generated with RunDiffusion/Juggernaut-XL [36]. Sam-

Fig 3

Fig 3: Representative images generated with Gemini. Each pair shows the original

Fig 1

Fig 1: Distribution of vehicle colors in UFPR-VeSV [28]. The strong imbalance moti-

Fig 5

Fig 5 (page 5).

Fig 6

Fig 6 (page 5).

Fig 7

Fig 7 (page 5).

Fig 8

Fig 8 (page 5).

Limitations

Remaining errors include a large fraction (~58.5%) that are inherently ambiguous even to human annotators, limiting achievable accuracy.
Synthetic data generation uses off-the-shelf models without domain-specific fine-tuning, which may limit relevance or introduce domain shifts.
The study focuses on static images and single-label color classification; multi-task or temporal data scenarios are not addressed.
UFPR-VeSV dataset is restricted access, limiting direct reproducibility for some researchers.
Foreground segmentation is evaluated only via background blurring, not end-to-end segmentation refinement.
No explicit robustness testing to adversarial attacks or distribution shifts (weather, camera types) beyond training/test splits.

Open questions / follow-ons

How does synthetic minority-class augmentation impact multi-task vehicle identification combining color, make, model, and type?
Can uncertainty-aware or human-in-the-loop frameworks better handle the large fraction of ambiguous cases identified?
How would domain-specific fine-tuning or conditional generation improve synthetic data quality and domain alignment?
Could temporal or multi-view data mitigate ambiguities inherent in single-frame VCR from surveillance?

Why it matters for bot defense

For bot-defense practitioners and CAPTCHA engineers, this paper exemplifies the fundamental challenges of learning robust classifiers under severe class imbalance and natural ambiguity, especially when critical discriminative cues (vehicle color here, captcha text or image semantics there) are noisy or partially hidden. The approach demonstrates that synthetic data generation combined with careful imbalance-aware training and conservative augmentations can improve recognition of rare but operationally important classes. The detailed error analysis also underscores that some recognition failures are not model limitations but inherent input ambiguities—paralleling challenges when dealing with adversarial or low-quality bot inputs. Techniques such as domain-aligned synthetic augmentation, weighted loss functions, and foreground emphasis could inspire analogous strategies in CAPTCHA hardening and bot-detection where minority but critical bot behaviors must be identified despite sparse or ambiguous signals. Furthermore, the manual quality control of synthetic samples warns that naive augmentation may degrade performance if not carefully curated—a caution relevant when synthesizing or perturbing CAPTCHA challenges to foil bots.

Cite

bibtex

@article{arxiv2606_13625,
  title={ Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios },
  author={ Vinícius Orrú and Bruno H. Foggiatto and Gabriel E. Lima and David Menotti and Rayson Laroca },
  journal={arXiv preprint arXiv:2606.13625},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.13625}
}

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​