From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

Source: arXiv:2606.14639 · Published 2026-06-12 · By Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

TL;DR

This work addresses the growing challenge of detecting highly realistic synthetic (spoofed) speech, a critical problem given advances in text-to-speech and voice conversion technologies. Traditional anti-spoofing systems often fail to generalize to unseen spoofing methods due to their dependence on specific artifact patterns. The authors propose converting a self-supervised speech model (WavLM-Large) into a Mixture-of-Experts (MoE) architecture by replacing feed-forward blocks in selected transformer layers with multiple expert networks controlled via a gating mechanism. This design allows specialized experts to capture complementary acoustic patterns while retaining pretrained representations. Through extensive experiments on 14 diverse spoofing datasets, their best MoE model achieves a macro Equal Error Rate (EER) reduction from 5.46% to 4.81%, an 11.9% relative improvement over the baseline SSL model. Comparative evaluations also show the full MoE fine-tuning outperforms prior LoRA-based expert adaptation methods that keep the backbone frozen. The analysis reveals no strong expert specialization by specific spoofing methods but suggests complex patterns are learned. Overall, the study demonstrates that transforming a pretrained speech model into an MoE architecture can improve robustness and generalization in speech anti-spoofing detection.

Key findings

Converting feed-forward blocks of the last 6 transformer layers of WavLM-Large into an MoE with 4 experts and top-k=1 routing reduces macro EER from 5.46% to 4.81%, an 11.9% relative improvement (Table 4, 6).
Statistical pooling for the gating network’s utterance-level embedding outperforms attentive, max, and mean pooling strategies, achieving the lowest macro EER of 4.81% (Table 5).
Activating multiple experts simultaneously (top-k > 1) tends to degrade performance, with top-k=1 consistently yielding better EERs—indicating specialized expert activation is beneficial (Table 6).
A smaller MoE with 2 experts and top-k=1 produces competitive results (macro EER=4.98%) while requiring significantly fewer parameters (227M vs 329M) (Table 6).
Comparing full fine-tuning MoE to LoRA-based low-rank expert adaptation shows the latter performs worse across all rank settings (best LoRA macro EER=6.66% vs 4.81%), indicating the advantage of full expert networks and joint fine-tuning (Table 7).
Expert activation distributions remain relatively balanced across different spoofing synthesizers, with low Jensen-Shannon divergences (~0.1-0.3), showing no clear routing specialization by attack type (Figure 3, Table 8).
Inserting MoE layers in the last transformer layers performs better than early, all layers, or alternating configurations, suggesting higher-level features benefit more from expert specialization (Table 4).
Restricting MHFA classification to the first 13 transformer layers instead of all 24 yields better downstream spoofing detection, confirming low- and mid-layer features are more relevant.

Threat model

Adversaries are attackers who generate spoofed speech using a variety of neural-based text-to-speech, voice conversion, and audio manipulation methods that produce highly naturalistic speech aiming to impersonate a legitimate speaker. They can use unknown or unseen synthesis technologies not present in the defender's training data. The defender's system cannot assume access to spoofing examples from all present and future spoofing methods. Adaptive, live adversarial attacks or attacks designed specifically to evade the gating mechanism are not modeled.

Methodology — deep read

The paper targets the adversary who uses diverse and evolving speech synthesis and voice conversion methods to generate spoofed audio that attempts to impersonate a genuine speaker. The defender's challenge is to generalize detection to unseen spoofing types without explicit training samples. The approach builds on pretrained self-supervised learning (SSL) speech models, specifically the WavLM-Large transformer model with 24 layers. The backbone convolutional feature extractor followed by transformer encoders produces rich acoustic and linguistic representations. The conventional feed-forward network (FFN) modules in selected transformer layers are replaced by Mixture-of-Experts (MoE) modules implementing multiple FFN experts (E=2 to 5 in experiments). Each expert is initialized with the original FFN weights from the pretrained SSL model, preserving learned representations. A gating network routes utterance-level representations (pooled frame-level outputs of the multi-head attention block) through softmax to select the top-k experts to activate per input. The gating network uses pooling strategies (mean, max, statistical, attentive statistical) to compress temporal frames to a fixed embedding. An auxiliary load-balancing loss encourages even expert utilization, preventing collapse to a single expert. The final loss combines binary cross-entropy detection loss with this auxiliary loss. Training uses six diverse publicly available spoofing datasets containing over 1.4 million samples spanning many spoofing methods, languages, and acoustic conditions. Development sets from several datasets are used for checkpoint selection. Evaluation is done on 14 spoofing datasets pooled from different challenges and corpora covering diverse spoofing types and real-world recordings. Equal error rate (EER) is computed per dataset (macro average) and over pooled trials (micro average). The architecture and hyperparameters are extensively studied: the number and positions of MoE transformer layers, the number of experts E, gating pooling strategies, and the top-k routing parameter. Models are trained with AdamW, batch size 128, 80k steps, learning rate 2e-5 warmup, weight decay 1e-2, and progressive unfreezing of SSL backbone parameters over 15% of training steps. Data augmentation with codec artifacts, noise, and reverberation is applied on-the-fly. Baselines include full SSL fine-tuning without MoE, and LoRA-based low-rank expert adaptations with frozen backbones for direct comparison. Detailed EER results per corpus illustrate the impact of design choices. Expert activation distributions across spoofing synthesizers are quantified using Jensen-Shannon divergence to analyze specialization. The authors release code based on the open-source Kiwano speaker embedding toolkit but the training data is a combination of public spoofing datasets. The methodology is reproducible given access to the same corpora and training setup. As a concrete example, the best model architecture inserts MoE layers in the last 6 WavLM transformer layers, with 4 full FFN experts per layer, top-k=1 gating with statistical pooling, trained end-to-end with balancing loss on six large spoofing corpora. This yields a macro EER of 4.81% across 14 diverse test datasets, improving over the baseline of 5.46%, and shows no clear expert specialization by spoofing method but robust complementary pattern learning.

Technical innovations

Full Mixture-of-Experts conversion of pretrained self-supervised speech models by replacing feed-forward networks with multiple full expert FFNs jointly fine-tuned, rather than low-rank LoRA modules.
Layer-wise gating network using utterance-level pooled features for top-k routing of experts, enabling conditional expert activation on speech input characteristics.
Use of auxiliary balanced load loss on gating probabilities to prevent expert collapse and ensure diverse expert utilization across training.
Extensive architectural study identifying optimal MoE insertion points (last 6 layers), number of experts (4), gating pooling strategy (statistical pooling), and routing top-k=1 for best spoofing detection performance.

Datasets

ASVspoof5 — 182K samples — public anti-spoofing corpus
ASVspoof2019 LA — 25K samples — public anti-spoofing corpus
ADD2022 — 27K samples — public anti-spoofing corpus
FakeOrReal — 54K samples — public anti-spoofing corpus
Codecfake — 741K samples — public anti-spoofing corpus
MLADD — 382K samples — public anti-spoofing corpus
Evaluation datasets: ASVspoof2019 LA, ASVspoof2021 LA and DF, ASVspoof5, Sonar, FakeOrReal, DFADD, Codecfake, LibriSeVoc, ADD2022 (Track1 and Track3), ADD2023 (Track1), InTheWild

Baselines vs proposed

Baseline WavLM-Large (13 layers): Macro EER = 5.46%, Micro EER = 14.95%
MoE (last 6 layers, E=4, k=1, stat pooling): Macro EER = 4.81%, Micro EER = 12.34% (11.9% relative improvement)
MoE-LoRA rank 8 (E=8): Macro EER = 6.80%, Micro EER = 16.23% vs MoE full fine-tuning 4.81% / 12.34%
MoE-LoRA rank 64 (E=16): Macro EER = 6.66%, Micro EER = 16.06% vs MoE full fine-tuning 4.81% / 12.34%
Increasing top-k routing from 1 to 2 or 3 degrades MoE performance (e.g. Macro EER 5.33% at k=2 vs 4.81% at k=1)

Limitations

No adversarial attacks simulated; robustness tested only on unseen spoofing methods but not malicious adaptive attackers.
Expert specialization analysis limited; no clear interpretable association between experts and spoofing types found.
Model size increase: MoE models have notably larger parameter counts (up to 378M) compared to standard SSL baselines (178M).
The approach prioritizes performance over parameter efficiency, which may limit applicability in resource-constrained environments.
Training and evaluation rely on publicly available spoofing datasets; real-world scenarios with novel unknown attacks could differ.
The gating mechanism uses utterance-level pooled features, potentially limiting temporal fine-grained expert routing.

Open questions / follow-ons

Can expert specialization be explicitly guided to capture distinct spoofing methods by architectural or training modifications?
How well do MoE-based models withstand adaptive adversarial spoofing attacks crafted to bypass gated experts?
What are the trade-offs in computation and latency for MoE architectures in real-time anti-spoofing deployments?
Can the gating network be improved to leverage more fine-grained temporal dynamics instead of utterance-level pooling?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, this research highlights the importance of scalable model architectures that generalize robustly to novel and unseen audio synthesis attacks, which is analogous to evolving bot generation techniques. The MoE architecture's conditional expert activation and expert load balancing could inspire adaptive CAPTCHA challenges that modulate difficulty or presentation based on detected user behavior or input features. The modular decoding approach in MoE may also contribute insights into multi-expert ensemble defenses against emerging spoofing or automated audio manipulation threats. However, the increase in model complexity and resources required must be balanced against real-time constraints in interactive bot-defense systems.

Cite

bibtex

@article{arxiv2606_14639,
  title={ From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing },
  author={ Hugo Daumain and Driss Matrouf and Khaled Khelif and Mickael Rouvier },
  journal={arXiv preprint arXiv:2606.14639},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14639}
}

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​