Skip to content

Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks

Source: arXiv:2605.23472 · Published 2026-05-22 · By Mehdi Gharbage, Céline Teulière, Pierre Bouges, Thierry Chateau

TL;DR

This paper investigates the efficacy of large-scale self-supervised pretraining, specifically DINOv3 distillation, compared to conventional supervised ImageNet pretraining for supervised industrial visual inspection tasks. Industrial inspection differs significantly from natural-image domains due to fine-grained, small defects and sometimes different imaging modalities such as X-ray, challenging the transferability of foundation models pretrained on web-scale natural images. The authors conduct a controlled comparison using ConvNeXt backbones pretrained either via supervised ImageNet classification or DINOv3 distillation, benchmarked against a ResNet-50 baseline. They evaluate transfer on three core tasks—semantic segmentation, instance segmentation, and object detection—across four industrial datasets spanning RGB surface defects and X-ray images. Both frozen-backbone and fully finetuned regimes are considered.

Key findings reveal that DINOv3 features do not provide consistent advantages in frozen transfer scenarios and underperform under X-ray modality shift. However, when full finetuning is permitted, DINOv3 initialization yields noticeably better final results and faster convergence on RGB inspection datasets compared to ImageNet pretraining. Conversely, supervised ImageNet pretraining remains a more robust starting point for X-ray defect detection across both frozen and finetune settings. This nuanced behavior indicates that modern vision foundation models like DINOv3 may enhance supervised RGB defect inspection, but their benefit hinges on sufficient downstream adaptation and target-modality similarity to natural images.

Key findings

  • DINOv3-distilled ConvNeXt-T backbone outperforms ResNet-50 ImageNet baseline by +10.59 mIoU on Severstal semantic segmentation after full finetuning.
  • On Rubber Rings semantic segmentation, DINOv3 achieves +10.29 mIoU gain over ImageNet-pretrained ConvNeXt-T after full finetuning.
  • In frozen backbone regime, DINOv3 and ImageNet pretraining give similar performance on Severstal (62.40 vs 62.04 mIoU), indicating no clear DINOv3 advantage without adaptation.
  • DINOv3 gives best mask mAP 84.50 vs. 82.88 for ImageNet pretraining on RarePlanes instance segmentation only after full finetuning; ImageNet is better frozen.
  • For GDXray X-ray defect detection, supervised ImageNet pretraining beats DINOv3 distinctly: 29.74 box mAP vs. 27.84 after finetuning; gap bigger frozen (21.32 vs. 7.88).
  • ConvNeXt backbones improve over classical ResNet-50 in all tasks after finetuning regardless of pretraining method.
  • Faster convergence and higher final performance with DINOv3 on RGB tasks occur only under full finetuning, not frozen transfer.

Threat model

n/a — this work is not about adversary modeling or attack-defense, but rather studies transfer learning behavior of pretrained backbones under domain and modality shifts in industrial inspection imagery.

Methodology — deep read

The study focuses on supervised industrial visual inspection with three downstream tasks: semantic segmentation, instance segmentation, and object detection. The threat model is not adversarial security-related but rather investigates representation transfer under domain and modality shifts in industrial imagery.

Core pretrained backbones are convolutional networks (ConvNeXt-T) pretrained by two paradigms: (1) supervised ImageNet-1k classification and (2) DINOv3 self-supervised distillation from a large-scale teacher (LVD-1689M images). A standard ResNet-50 pretrained on ImageNet serves as a conventional baseline. The study evaluates transfer on four industrial datasets:

  • Severstal (RGB steel surface defects, semantic segmentation)
  • Rubber Rings (RGB defects in manufactured parts, semantic segmentation)
  • RarePlanes (RGB aerial images, instance segmentation)
  • GDXray Castings (X-ray images, object detection)

Downstream decoders are standard architectures: Mask2Former for segmentation, Mask R-CNN for instance segmentation, and Faster R-CNN for detection. Experiments follow controlled training settings without dataset-specific hyperparameter tuning, and employ defect-aware cropping to bias samples toward defect regions.

Two adaptation regimes are tested: frozen backbones with only decoder heads trained, and full finetuning where the entire backbone is jointly optimized. Metrics used include mean Intersection-over-Union (mIoU) for segmentation, mask mean Average Precision (mAP) for instance segmentation, and bounding box AP for detection.

Training uses Detectron2 framework with fixed seed to ensure reproducibility. For example, on Severstal semantic segmentation using Mask2Former with ConvNeXt-T, the authors train both frozen and fully finetuned models initialized with either ImageNet supervised or DINOv3 distilled weights. They compare mIoU over training iterations and final validation scores. Similar protocol applies to other tasks/datasets.

The approach isolates the pretraining effect by using identical architectures, decoders, and training schedules, focusing comparison on pretraining paradigm (supervised vs. self-supervised) and adaptation regime (frozen vs. finetuned). This enables controlled evaluation of transferability and adaptation difficulty across modalities and tasks.

Code and pretrained weights releases are not explicitly stated and data splits follow official or custom partitions detailed in Table 1. Overall, the methodology combines careful architectural control, multiple datasets, dense prediction tasks, and varied adaptation to dissect the practical value of modern foundation models for industrial inspection under domain/modality shift conditions.

Technical innovations

  • A controlled comparative analysis of supervised ImageNet pretraining versus large-scale DINOv3 self-supervised distillation for industrial inspection tasks.
  • Evaluation of transferability under both frozen backbone and full finetuning regimes on multiple downstream tasks and datasets.
  • Inclusion of strong modality shift testing via X-ray imagery to probe robustness of web-scale pretrained features beyond RGB natural-image domains.
  • Application of defect-aware cropping and standardized decoder architectures to fairly isolate pretrained backbone effects.

Datasets

  • Severstal — 6666 images (5332 train, 1334 val) — RGB steel surface defect dataset
  • Rubber Rings — 197 images (158 train, 39 val) — RGB rubber ring defect dataset
  • RarePlanes — 5587 images (4474 train, 1113 val) — RGB aerial imagery for instance segmentation
  • GDXray Castings — 1062 images (706 train, 356 val) — X-ray castings defect dataset

Baselines vs proposed

  • ResNet-50 ImageNet supervised baseline: Severstal mIoU = 63.28 full fine-tune; DINOv3 ConvNeXt-T full fine-tune = 64.01
  • ConvNeXt-T ImageNet supervised frozen: Severstal mIoU = 62.04; DINOv3 = 62.40
  • ConvNeXt-T ImageNet supervised full fine-tune: Severstal mIoU = 62.97; DINOv3 full fine-tune = 64.01
  • ResNet-50 ImageNet supervised baseline: Rubber Rings mIoU = 73.87; DINOv3 ConvNeXt-T full fine-tune = 75.60
  • ConvNeXt-T ImageNet supervised frozen: Rubber Rings mIoU = 73.25; DINOv3 frozen = 72.32
  • RarePlanes instance segmentation mask mAP: ImageNet ConvNeXt-T frozen = 72.89, DINOv3 frozen = 70.36; ImageNet full fine-tune = 82.88; DINOv3 full fine-tune = 84.50
  • GDXray detection box mAP@50: ImageNet ConvNeXt-T frozen = 21.32; DINOv3 frozen = 7.88; ImageNet full fine-tune = 29.74; DINOv3 full fine-tune = 27.84

Limitations

  • Limited exploration of Vision Transformer backbones; study focused exclusively on ConvNet architectures.
  • Lack of extensive hyperparameter tuning or adaptation strategies that might improve DINOv3 transferability on X-ray data.
  • No adversarial or robustness evaluation against realistic industrial defect variations or noisy labels.
  • Frozen feature transfer evaluation may not capture full capacity of self-supervised models for all tasks.
  • No details on code or pretrained weight availability limit reproducibility verification.
  • Datasets used are relatively modest in size and do not cover all possible industrial modalities or defect types.

Open questions / follow-ons

  • Can industrial-specific self-supervised pretraining (beyond natural image web data) improve robustness and transfer under X-ray or other modality shifts?
  • How do Vision Transformer architectures pretrained with DINOv3 compare in the same controlled industrial inspection setting?
  • What adaptation methods or domain-tuning strategies can further leverage DINOv3 features for strong or zero-shot inspection performance?
  • Can combining supervised and self-supervised pretraining modalities produce more universal pretrained backbones for industrial domains?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this study underlines that large-scale self-supervised pretrained vision models may offer more useful feature initializations when full finetuning is possible and when the target domain resembles natural images in modality and appearance. However, in scenarios where only frozen features are leveraged or where image modalities differ significantly (e.g., X-ray or other specialized sensor data), conventional supervised pretraining on curated datasets may remain more reliable. This suggests careful evaluation of foundation model applicability is needed when designing visual CAPTCHAs or bot-detection image pipelines involving non-natural or specialized imagery. Moreover, fine-grained recognition tasks with small visual targets like defect detection may particularly benefit from the inductive biases of convolutional architectures and domain-specific adaptation strategies, which could inform CAPTCHA challenge design for high precision localization. Overall, this work encourages cautious optimism about modern foundation models but emphasizes adaptation regime and domain similarity as critical factors for successful transfer.

Cite

bibtex
@article{arxiv2605_23472,
  title={ Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks },
  author={ Mehdi Gharbage and Céline Teulière and Pierre Bouges and Thierry Chateau },
  journal={arXiv preprint arXiv:2605.23472},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.23472}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution