Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

Source: arXiv:2606.12495 · Published 2026-06-10 · By Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

TL;DR

This paper addresses robust speaker identification in realistic polyglot situations where speakers use multiple languages and face modality (visual cues) may be missing. The key challenges are generalizing speaker discriminative features across languages and maintaining reliable performance when face data is unavailable, due to occlusion or detection failure. The authors propose MRAF, a novel framework which introduces a learnable missing token to represent the absence of face embeddings instead of zero padding, reducing distribution shifts between complete and missing-face inputs. MRAF further employs a reliability-aware cross-attention fusion module to estimate and weight the confidence of face and audio modalities before fusing their embeddings, allowing adaptive reliance on more dependable cues per sample. Training combines multi-branch classification losses, audio-only knowledge distillation, and center loss to improve missing-modality robustness and speaker discrimination. Evaluated on the MAV-Celeb dataset following the POLY-SIM 2026 Challenge protocol, MRAF achieves perfect accuracy (100%) on in-language and cross-lingual multimodal tasks (P3, P5) and competitive performance on missing-face (audio-only) tests (P4, P6), ranking second overall. Ablations confirm the benefits of the missing token, reliability-aware fusion, and balanced multimodal/audio-only training. The approach offers a unified token-level model robust to missing face data and cross-lingual variation.

Key findings

MRAF achieves 100% top-1 accuracy on the complete-modality in-language (P3) and cross-lingual (P5) scenarios on the POLY-SIM 2026 test set.
On missing-face (audio-only) tasks P4 and P6, MRAF attains 98.95% and 99.32% accuracy respectively, outperforming zero-filling and heuristic completion baselines.
Compared to the baseline method achieving 73.37% average accuracy across tasks, MRAF improves average accuracy to 99.57%, with large gains (46.4% on P4 and 55.4% on P6).
Reliability-aware cross-attention fusion surpasses linear (+12.6%), gated (+13.2%), and LSTM (+0.22%) fusion methods in average accuracy (Table 3).
The learnable missing token representation for missing face inputs improves P6 accuracy from 99.01% (zero filling) to 99.32% and yields the highest average accuracy (Table 4).
Training with 20% audio-only and 80% full-modality samples balances multimodal utilization and missing-face robustness, yielding best overall results (Fig. 3).
Multi-branch classification with fusion, audio-only knowledge distillation, and center loss enhances speaker discriminability and missing-modality resilience.
The reliability scorers estimate scalar modality confidence per sample, used to reweight and modulate token embeddings adaptively before fusion.

Threat model

The study does not explicitly frame a formal adversary model but assumes natural modality availability challenges such as occluded or missing face data, noisy audio, and cross-lingual domain shifts. The system must robustly infer speaker ID despite incomplete or degraded multimodal inputs but does not consider malicious adversaries actively attempting to spoof or fool the system.

Methodology — deep read

The study tackles polyglot multimodal speaker identification, aiming to predict speaker IDs from paired face and audio embeddings across multiple languages, with missing face modality possible at inference.

Threat Model & Assumptions: The adversary context is not explicitly defined as this is a robustness task rather than adversarial security. The model assumes face and audio modalities may be partially missing or unreliable due to occlusions, noise, or mismatched languages. The system must generalize across language domains and varying modality availability.

Data: Experiments use the MAV-Celeb dataset, containing English and Urdu speakers with paired face and speech from unconstrained YouTube videos. The dataset size comprises thousands of samples split into train/val/test with English used for training and Urdu for cross-lingual evaluation, summarizing 4039/1290/1521 English and 9304/1779/1623 Urdu samples (Table 1). Labels are speaker IDs.

Architecture: MRAF consists of (1) Missing-Token Prompted Modality Embedding, (2) Reliability-Aware Cross-Attention Fusion, and (3) Multi-Branch Classification.

Missing-Token Embedding: Face and audio input features are projected via learned linear layers into a sequence of latent tokens plus positional embeddings, processed through multiple Transformer encoder layers. When face is missing, a learnable missing token replaces the zero face input, allowing trainable missing-face representation alignment in the token space.
Reliability-Aware Fusion: Scalar reliability scores for face and audio modalities are estimated via lightweight MLP scorers applied to respective embeddings, producing normalized modality weights. Token embeddings are scaled by these weights and fused through bidirectional cross-attention, allowing tokens in each modality to attend to the other while emphasizing higher-confidence inputs. The fused embedding is pooled and normalized to produce the final multimodal representation.
Classification: Three classifier heads predict speaker IDs from face embedding, audio embedding, and fused embedding. Fusion head output is used at inference.

Training Regime: Trained with Adam optimizer, learning rate 1e-4, batch size 64, dropout 0.1, embedding dim 512. Training uses a mixture of full-modality (face+audio) and audio-only samples at 80% and 20%, respectively, to simulate missing-face conditions and improve robustness. Loss combines cross-entropy on all three branches, audio-only knowledge distillation (KL divergence with temperature=2.0) from fusion branch to audio branch logits, and center loss on fused embeddings for intra-class compactness. Weighting hyperparameters balance these terms.

Evaluation: Evaluation follows the POLY-SIM 2026 Challenge protocol over 4 settings: P3 (in-language full-modality), P4 (in-language audio-only), P5 (cross-lingual full-modality), P6 (cross-lingual audio-only). Top-1 accuracy is the metric. Comparisons include official baseline and other top submissions. Ablations evaluate fusion methods, missing modality modeling methods, and training sampling ratios. The model is run on an RTX 4090 GPU.

Reproducibility: Code and pretrained models are planned for public release at the project GitHub linked in the paper. Dataset is public MAV-Celeb used by benchmark.

End-to-End Example: During inference on an audio-only cross-lingual sample (P6), the face embedding input is replaced by the learnable missing token. The model estimates near-zero face reliability and high audio reliability, weights the tokens accordingly, and uses cross-attention fusion to produce the fused representation. The fusion classification head outputs the predicted speaker ID robustly despite missing visual input and language domain shift.

Technical innovations

Introduction of a learnable missing token to represent absent face modality, enabling unified token-level modeling and reduced distribution shift versus zero-filling.
Reliability-aware cross-attention fusion module that estimates per-sample face and audio confidence scores to adaptively weight and fuse modality token embeddings bidirectionally.
Joint optimization of multi-branch classification heads with audio-only knowledge distillation bridging full-modality and missing-face inference modes.
Balanced training sampling strategy mixing full-modality and audio-only samples to improve robustness without sacrificing visual cue exploitation.

Datasets

MAV-Celeb — approx. 21,500 samples total across train/val/test — public YouTube-based dataset of English and Urdu speakers

Baselines vs proposed

Baseline (mmosc): average accuracy = 0.7337 vs MRAF: 0.9957
Linear fusion: avg accuracy = 0.8693 vs MRAF cross-attention: 0.9957
Gated fusion: avg accuracy = 0.8633 vs MRAF cross-attention: 0.9957
LSTM fusion: avg accuracy = 0.9935 vs MRAF cross-attention: 0.9957
Zero filling missing face: avg accuracy = 0.9940 vs learnable missing token: 0.9957
Audio completion missing face: avg accuracy = 0.9940 vs learnable missing token: 0.9957
Memory bank selection missing face: avg accuracy = 0.9946 vs learnable missing token: 0.9957

Limitations

Model may degrade on noisy audio, low-quality face images, or face-audio mismatches reducing reliable modality cues.
Cross-lingual shifts in pronunciation and acoustic conditions remain challenging beyond tested Hindi and English languages.
Only evaluated on MAV-Celeb dataset and POLY-SIM 2026 splits; generalization to wider languages and unseen conditions is untested.
No adversarial robustness or intentional modality corruption attacks analyzed.
Ablation on hyperparameter sensitivity and alternative reliability scorer designs could be expanded.

Open questions / follow-ons

How does MRAF perform under adversarial or spoofing attacks targeting either modality?
Can the learnable missing token concept generalize to other missing-modality scenarios beyond missing face, such as missing audio or multiple missing modalities?
What are the trade-offs and behaviors if more language domains or unseen languages are introduced during evaluation?
Can stronger language-invariant speaker representations be developed to further reduce phonetic variability impact?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, MRAF exemplifies robust multimodal biometric fusion techniques resilient to modality absence or degradation, a common scenario in real-world security systems with video and audio inputs. Its learnable missing token mechanism provides a method to reduce distribution shift effects when face data is unavailable—a typical occurrence due to camera occlusion or privacy restrictions. Additionally, the reliability-aware fusion strategy adaptively weighs modalities based on quality, improving system reliability by discounting noisy or missing signals.

Practitioners can consider similar trainable placeholder token approaches and modality confidence scoring when designing biometric or behavioral challenge-response systems that must degrade gracefully in partial input conditions. The audio-only distillation approach also informs how to transfer multimodal knowledge to unimodal fallback modes for robustness. However, care should be taken as the method focuses on speaker-identification tasks and assumes predefined speaker classes, thus direct application to open-set bot detection or generalized CAPTCHA may require adaptation. Evaluation across adversarial conditions and broader language settings would be needed for deployment-oriented risk assessment.

Cite

bibtex

@article{arxiv2606_12495,
  title={ Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification },
  author={ Peng Jia and Li Dai and Jia Li and Zhenzhen Hu and Ye Zhao and Richang Hong },
  journal={arXiv preprint arXiv:2606.12495},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12495}
}

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​