Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Source: arXiv:2605.06643 · Published 2026-05-07 · By Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi, Olga Fink

TL;DR

This paper addresses the unclear progress in Multimodal Domain Generalization (MMDG), a field aiming to improve robustness of models across unseen domains by leveraging multiple input modalities (e.g., video, audio, text). Despite many recent specialized methods, inconsistent evaluation practices and fragmented datasets have obscured whether reported gains reflect true advances or experimental artifacts. To resolve this, the authors introduce MMDG-Bench, the first comprehensive, standardized benchmark unifying six datasets spanning three tasks (action recognition, fault diagnosis, sentiment analysis), six modality combinations, and nine representative algorithms. Over 7,400 neural networks were trained across 95 cross-domain tasks using uniform data splits, hyperparameter tuning, and evaluation protocols that also assess corruption robustness, missing modality scenarios, and model trustworthiness. Results reveal that recent methods offer only marginal improvements over the strong ERM baseline, with no single approach consistently dominating across datasets or modalities. A large performance gap to the Oracle upper bound confirms MMDG remains a challenging open problem. Additionally, trimodal fusion does not reliably outperform the best bimodal setups, and all models degrade heavily when confronted with corrupted or missing inputs, sometimes harming calibration metrics. Thus, the paper surfaces important limitations of current MMDG research and highlights new directions for reliable, robust multimodal deployment.

Key findings

Recent specialized MMDG methods yield only marginal improvements over ERM baseline under fair and standardized evaluation.
No single method consistently dominates across datasets, modality combinations, or task families; performance rankings fluctuate widely.
A substantial gap to the Oracle target-trained model remains, e.g., on HAC video+audio Oracle achieves 92.81% mean accuracy vs 70.95% best method (MOOSA), implying MMDG is far from solved.
Trimodal fusion (video+audio+flow) does not consistently outperform strongest bimodal pairs; some methods even degrade performance when adding modalities.
All evaluated methods experience significant accuracy drops under realistic input corruptions (video defocus, audio wind noise) and missing modality conditions.
Methods optimized for clean-domain alignment often lack robustness to corrupted inputs and missing modalities, with some harming model trustworthiness despite accuracy gains.
Misclassification detection and out-of-distribution detection are complementary but distinct trustworthiness dimensions; no method excels across both simultaneously.
Corruption robustness and modality balancing objectives improve resilience but remain insufficient to close the robustness gap.

Threat model

The adversary induces environmental domain shifts and realistic input corruptions such as sensor noise and occlusions. The attacker cannot alter training data or access target domain samples during training but may cause missing modalities or corrupted inputs at test time. The defender aims to train models that generalize across unseen domains and remain robust despite corrupted or incomplete multimodal observations.

Methodology — deep read

The paper introduces MMDG-Bench, a unified benchmark to fairly evaluate multimodal domain generalization.

Threat model & assumptions: The adversary induces domain shifts manifesting as changes in environmental conditions, sensor noise, or cultural variations. The model must generalize to unseen target domains without access to target domain data during training, under both multi-source and single-source domain generalization settings. Realistic corruptions and missing modalities simulate sensor failures and input perturbations.
Data: MMDG-Bench includes six public datasets spanning three tasks: EPIC-Kitchens and HAC for action recognition (video, audio, optical flow modalities); HUST motor dataset for mechanical fault diagnosis (vibration, acoustic); CMU-MOSI, CMU-MOSEI, CH-SIMS for multimodal sentiment analysis (video, audio, text). The datasets have multiple domains with varying environmental or cultural characteristics. Standardized splits, preprocessing, and multi- and single-source source-target domain splits are used.
Algorithms: Nine representative MMDG methods were evaluated, with ERM as the strong baseline. Methods include RNA-Net (feature norm alignment), SimMMDG (modality-shared/specific decomposition with contrastive learning), MOOSA (self-supervised masked cross-modal translation and jigsaw puzzles), CMRF (sharpness-aware minimization with feature distillation), NEL (nonpolarized learning), JAT (adversarial domain invariance), MBCD (collaborative distillation and modality dropout), GMP (gradient modulation). An Oracle model trained with target domain data serves as upper bound. Architectures include pretrained SlowFast and ResNet backbones for video/audio, 1D CNNs for vibration/acoustic, Transformer encoders for sentiment.
Training regime: The authors conducted extensive hyperparameter random search (10 trials per method-dataset pair) with early stopping based on training-domain validation. Each best setting was retrained with two additional random seeds to reduce variance. Models were trained on GPUs and standardized batch sizes, optimizers, and model selection criteria were applied to ensure reproducibility and fair comparison.
Evaluation protocol: Performance was evaluated using classification accuracy, F1, MAE (sentiment), corruption robustness measured by accuracy degradation on corrupted inputs, missing modality generalization by ablating modalities during inference, and trustworthiness via misclassification detection (AURC, AUROC, FPR95) and OOD detection (AUROC, FPR95). Results are aggregated across 95 unique domain-transfer tasks for robust conclusions.
Reproducibility: The authors released the MMDG-Bench code and standardized pipeline to enable future fair comparison. The datasets are public, but no fixed frozen weights were reported.

One concrete example: On the EPIC-Kitchens action recognition dataset with modalities video and audio, multi-source domain generalization results show ERM baseline achieves mean accuracy ~59.78%, while MOOSA reaches 63.76%, and Oracle 71.05%. Under audio wind noise corruption, model accuracy drops 0.77 to 4.22 points on average, while video defocus blur is far more damaging (7.97 to 12.82 points drop). Removing the audio modality causes minor degradation but removing video causes catastrophic failure (over 40 point drop). This pattern highlights a strong modality hierarchy and need for robustness-focused approaches.

Technical innovations

MMDG-Bench: First standardized, large-scale benchmark for multimodal domain generalization covering diverse tasks, modality combinations, and robustness/trustworthiness metrics.
Comprehensive evaluation beyond clean accuracy includes corruption robustness, missing modality generalization, misclassification detection, and OOD detection under consistent protocols.
Systematic hyperparameter tuning and seed averaging to isolate true algorithmic improvements versus evaluation artifacts.
Demonstration that specialized MMDG objectives provide only marginal gains over ERM when fairly compared, highlighting the importance of rigorous benchmarking.

Datasets

EPIC-Kitchens — video/audio/optical flow — egocentric action recognition
HAC — video/audio/optical flow — human, animal, cartoon action recognition
HUST motor — vibration/acoustic signals — mechanical fault diagnosis
CMU-MOSI — video/audio/text — multimodal sentiment analysis
CMU-MOSEI — video/audio/text — multimodal sentiment analysis
CH-SIMS — video/audio/text — multimodal sentiment analysis

Baselines vs proposed

On EPIC-Kitchens V+A: ERM accuracy mean 59.78% vs MOOSA 63.76%; Oracle 71.05%
On HAC V+A: ERM mean 68.93% vs MOOSA 70.95%; Oracle 92.81% mean accuracy
On HUST motor (vibration+acoustic): ERM 69.90% vs MOOSA 78.23%; Oracle 99.87%
On sentiment analysis (MOSI, MOSEI → SIMS): ERM ACC2 65.63% vs MOOSA 66.60%; Oracle 75.22%
Under audio corruption, accuracy dropped 0.77-4.22% across methods; video corruption caused larger drops 7.97-12.82%
Removing video modality at inference causes up to 43.93 point accuracy drop vs removing audio results in 0.32-3.20 point drops
Misclassification detection AUROC best by SimMMDG; OOD detection AUROC also best by SimMMDG but methods differ by metric

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06643.

Fig 2

Fig 2: Illustration of three core tasks included in the MMDG-Bench.

Fig 2

Fig 2 (page 4).

Fig 3

Fig 3 (page 4).

Fig 4

Fig 4 (page 4).

Fig 5

Fig 5 (page 4).

Fig 6

Fig 6 (page 4).

Fig 7

Fig 7 (page 4).

Fig 8

Fig 8 (page 4).

Limitations

Evaluation focuses on three task families, potentially limiting generalization to other multimodal applications.
Study covers nine representative but not exhaustive MMDG methods; newer approaches post-2026 not included.
Robustness evaluations consider two synthetic corruptions (wind noise, defocus blur) but real-world corruptions are more diverse.
Oracle upper bound requires labeled target data, which is often unavailable in practice, limiting practical applicability of absolute gap measure.
Single missing-modality ablations may not test robustness under multi-modality failures or adversarial manipulations.
Trustworthiness evaluations use standard uncertainty metrics but do not explore calibrated Bayesian or test-time adaptation methods.

Open questions / follow-ons

How can future models better integrate modality balancing and competition-aware mechanisms to consistently improve trimodal fusion?
What techniques can jointly improve both corruption robustness and model trustworthiness under domain shifts?
Can novel self-supervised or adaptation schemes close the large remaining gap to the Oracle target domain model?
How do multimodal domain generalization strategies perform under adversarial or multiple modality corruption attacks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this study underscores that incorporating multiple modalities is not a guaranteed path to robust domain generalization; naive fusion may even undermine robustness under real-world conditions like sensor noise or partial observations. The finding that clean accuracy does not predict corruption robustness or trustworthiness highlights the importance of extensive evaluation under corrupted and missing inputs before deploying multimodal defenses. The large performance gap to oracle models also suggests current MMDG techniques remain immature for high-assurance security deployments. Practitioners should carefully evaluate modality importance hierarchies and focus on robustness to input failures, as auxiliary modalities like audio may sometimes degrade performance if not properly handled. This work encourages adopting standardized, reproducible benchmarks covering diverse modalities, domain shifts, and trust metrics to assess multimodal approaches in bot-defense scenarios rigorously.

Cite

bibtex

@article{arxiv2605_06643,
  title={ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study },
  author={ Hao Dong and Hongzhao Li and Shupan Li and Muhammad Haris Khan and Eleni Chatzi and Olga Fink },
  journal={arXiv preprint arXiv:2605.06643},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06643}
}

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​