Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Source: arXiv:2605.06643 · Published 2026-05-07 · By Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi, Olga Fink
TL;DR
This paper addresses the unclear progress in Multimodal Domain Generalization (MMDG), a field aiming to improve robustness of models across unseen domains by leveraging multiple input modalities (e.g., video, audio, text). Despite many recent specialized methods, inconsistent evaluation practices and fragmented datasets have obscured whether reported gains reflect true advances or experimental artifacts. To resolve this, the authors introduce MMDG-Bench, the first comprehensive, standardized benchmark unifying six datasets spanning three tasks (action recognition, fault diagnosis, sentiment analysis), six modality combinations, and nine representative algorithms. Over 7,400 neural networks were trained across 95 cross-domain tasks using uniform data splits, hyperparameter tuning, and evaluation protocols that also assess corruption robustness, missing modality scenarios, and model trustworthiness. Results reveal that recent methods offer only marginal improvements over the strong ERM baseline, with no single approach consistently dominating across datasets or modalities. A large performance gap to the Oracle upper bound confirms MMDG remains a challenging open problem. Additionally, trimodal fusion does not reliably outperform the best bimodal setups, and all models degrade heavily when confronted with corrupted or missing inputs, sometimes harming calibration metrics. Thus, the paper surfaces important limitations of current MMDG research and highlights new directions for reliable, robust multimodal deployment.
Key findings
- Recent specialized MMDG methods yield only marginal improvements over ERM baseline under fair and standardized evaluation.
- No single method consistently dominates across datasets, modality combinations, or task families; performance rankings fluctuate widely.
- A substantial gap to the Oracle target-trained model remains, e.g., on HAC video+audio Oracle achieves 92.81% mean accuracy vs 70.95% best method (MOOSA), implying MMDG is far from solved.
- Trimodal fusion (video+audio+flow) does not consistently outperform strongest bimodal pairs; some methods even degrade performance when adding modalities.
- All evaluated methods experience significant accuracy drops under realistic input corruptions (video defocus, audio wind noise) and missing modality conditions.
- Methods optimized for clean-domain alignment often lack robustness to corrupted inputs and missing modalities, with some harming model trustworthiness despite accuracy gains.
- Misclassification detection and out-of-distribution detection are complementary but distinct trustworthiness dimensions; no method excels across both simultaneously.
- Corruption robustness and modality balancing objectives improve resilience but remain insufficient to close the robustness gap.
Threat model
The adversary induces environmental domain shifts and realistic input corruptions such as sensor noise and occlusions. The attacker cannot alter training data or access target domain samples during training but may cause missing modalities or corrupted inputs at test time. The defender aims to train models that generalize across unseen domains and remain robust despite corrupted or incomplete multimodal observations.
Methodology — deep read
The paper introduces MMDG-Bench, a unified benchmark to fairly evaluate multimodal domain generalization.
Threat model & assumptions: The adversary induces domain shifts manifesting as changes in environmental conditions, sensor noise, or cultural variations. The model must generalize to unseen target domains without access to target domain data during training, under both multi-source and single-source domain generalization settings. Realistic corruptions and missing modalities simulate sensor failures and input perturbations.
Data: MMDG-Bench includes six public datasets spanning three tasks: EPIC-Kitchens and HAC for action recognition (video, audio, optical flow modalities); HUST motor dataset for mechanical fault diagnosis (vibration, acoustic); CMU-MOSI, CMU-MOSEI, CH-SIMS for multimodal sentiment analysis (video, audio, text). The datasets have multiple domains with varying environmental or cultural characteristics. Standardized splits, preprocessing, and multi- and single-source source-target domain splits are used.
Algorithms: Nine representative MMDG methods were evaluated, with ERM as the strong baseline. Methods include RNA-Net (feature norm alignment), SimMMDG (modality-shared/specific decomposition with contrastive learning), MOOSA (self-supervised masked cross-modal translation and jigsaw puzzles), CMRF (sharpness-aware minimization with feature distillation), NEL (nonpolarized learning), JAT (adversarial domain invariance), MBCD (collaborative distillation and modality dropout), GMP (gradient modulation). An Oracle model trained with target domain data serves as upper bound. Architectures include pretrained SlowFast and ResNet backbones for video/audio, 1D CNNs for vibration/acoustic, Transformer encoders for sentiment.
Training regime: The authors conducted extensive hyperparameter random search (10 trials per method-dataset pair) with early stopping based on training-domain validation. Each best setting was retrained with two additional random seeds to reduce variance. Models were trained on GPUs and standardized batch sizes, optimizers, and model selection criteria were applied to ensure reproducibility and fair comparison.
Evaluation protocol: Performance was evaluated using classification accuracy, F1, MAE (sentiment), corruption robustness measured by accuracy degradation on corrupted inputs, missing modality generalization by ablating modalities during inference, and trustworthiness via misclassification detection (AURC, AUROC, FPR95) and OOD detection (AUROC, FPR95). Results are aggregated across 95 unique domain-transfer tasks for robust conclusions.
Reproducibility: The authors released the MMDG-Bench code and standardized pipeline to enable future fair comparison. The datasets are public, but no fixed frozen weights were reported.
One concrete example: On the EPIC-Kitchens action recognition dataset with modalities video and audio, multi-source domain generalization results show ERM baseline achieves mean accuracy ~59.78%, while MOOSA reaches 63.76%, and Oracle 71.05%. Under audio wind noise corruption, model accuracy drops 0.77 to 4.22 points on average, while video defocus blur is far more damaging (7.97 to 12.82 points drop). Removing the audio modality causes minor degradation but removing video causes catastrophic failure (over 40 point drop). This pattern highlights a strong modality hierarchy and need for robustness-focused approaches.
Technical innovations
- MMDG-Bench: First standardized, large-scale benchmark for multimodal domain generalization covering diverse tasks, modality combinations, and robustness/trustworthiness metrics.
- Comprehensive evaluation beyond clean accuracy includes corruption robustness, missing modality generalization, misclassification detection, and OOD detection under consistent protocols.
- Systematic hyperparameter tuning and seed averaging to isolate true algorithmic improvements versus evaluation artifacts.
- Demonstration that specialized MMDG objectives provide only marginal gains over ERM when fairly compared, highlighting the importance of rigorous benchmarking.
Datasets
- EPIC-Kitchens — video/audio/optical flow — egocentric action recognition
- HAC — video/audio/optical flow — human, animal, cartoon action recognition
- HUST motor — vibration/acoustic signals — mechanical fault diagnosis
- CMU-MOSI — video/audio/text — multimodal sentiment analysis
- CMU-MOSEI — video/audio/text — multimodal sentiment analysis
- CH-SIMS — video/audio/text — multimodal sentiment analysis
Baselines vs proposed
- On EPIC-Kitchens V+A: ERM accuracy mean 59.78% vs MOOSA 63.76%; Oracle 71.05%
- On HAC V+A: ERM mean 68.93% vs MOOSA 70.95%; Oracle 92.81% mean accuracy
- On HUST motor (vibration+acoustic): ERM 69.90% vs MOOSA 78.23%; Oracle 99.87%
- On sentiment analysis (MOSI, MOSEI → SIMS): ERM ACC2 65.63% vs MOOSA 66.60%; Oracle 75.22%
- Under audio corruption, accuracy dropped 0.77-4.22% across methods; video corruption caused larger drops 7.97-12.82%
- Removing video modality at inference causes up to 43.93 point accuracy drop vs removing audio results in 0.32-3.20 point drops
- Misclassification detection AUROC best by SimMMDG; OOD detection AUROC also best by SimMMDG but methods differ by metric
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06643.

Fig 2: Illustration of three core tasks included in the MMDG-Bench.

Fig 2 (page 4).

Fig 3 (page 4).

Fig 4 (page 4).

Fig 5 (page 4).

Fig 6 (page 4).

Fig 7 (page 4).

Fig 8 (page 4).
Limitations
- Evaluation focuses on three task families, potentially limiting generalization to other multimodal applications.
- Study covers nine representative but not exhaustive MMDG methods; newer approaches post-2026 not included.
- Robustness evaluations consider two synthetic corruptions (wind noise, defocus blur) but real-world corruptions are more diverse.
- Oracle upper bound requires labeled target data, which is often unavailable in practice, limiting practical applicability of absolute gap measure.
- Single missing-modality ablations may not test robustness under multi-modality failures or adversarial manipulations.
- Trustworthiness evaluations use standard uncertainty metrics but do not explore calibrated Bayesian or test-time adaptation methods.
Open questions / follow-ons
- How can future models better integrate modality balancing and competition-aware mechanisms to consistently improve trimodal fusion?
- What techniques can jointly improve both corruption robustness and model trustworthiness under domain shifts?
- Can novel self-supervised or adaptation schemes close the large remaining gap to the Oracle target domain model?
- How do multimodal domain generalization strategies perform under adversarial or multiple modality corruption attacks?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, this study underscores that incorporating multiple modalities is not a guaranteed path to robust domain generalization; naive fusion may even undermine robustness under real-world conditions like sensor noise or partial observations. The finding that clean accuracy does not predict corruption robustness or trustworthiness highlights the importance of extensive evaluation under corrupted and missing inputs before deploying multimodal defenses. The large performance gap to oracle models also suggests current MMDG techniques remain immature for high-assurance security deployments. Practitioners should carefully evaluate modality importance hierarchies and focus on robustness to input failures, as auxiliary modalities like audio may sometimes degrade performance if not properly handled. This work encourages adopting standardized, reproducible benchmarks covering diverse modalities, domain shifts, and trust metrics to assess multimodal approaches in bot-defense scenarios rigorously.
Cite
@article{arxiv2605_06643,
title={ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study },
author={ Hao Dong and Hongzhao Li and Shupan Li and Muhammad Haris Khan and Eleni Chatzi and Olga Fink },
journal={arXiv preprint arXiv:2605.06643},
year={ 2026 },
url={https://arxiv.org/abs/2605.06643}
}