Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian

Source: arXiv:2606.20478 · Published 2026-06-18 · By Ruchi Pandey, Tomi Kinnunen

TL;DR

This paper focuses on acoustic-to-articulatory inversion (AAI), a challenging problem of estimating continuous vocal tract movements from acoustic speech signals. Prior work in AAI has suffered from limited datasets dominated by English speakers and poor generalization across speaker and language domains. The authors address these limitations by benchmarking AAI on FROST-EMA, a newly introduced bilingual Finnish-Russian articulatory dataset containing 18 speakers recorded under uniform conditions. They systematically evaluate key factors impacting AAI robustness: choice of articulatory targets (raw EMA sensor coordinates versus tract variables), acoustic front-ends (classic MFCC features versus modern frozen self-supervised learning embeddings), and inversion back-ends (BiLSTM versus a lightweight attention model). They further isolate domain shift effects by defining evaluation protocols targeting cross-gender transfer within the same language and cross-language transfer within the same gender. The results indicate that cross-gender mismatch leads to moderate performance drops (about 0.05–0.10 Pearson correlation), while cross-language mismatch causes larger degradation (0.10–0.20). SSL features like Wav2Vec 2.0 consistently outperform MFCCs, and BiLSTM back-ends surpass attention models, particularly with limited training data. Tract variables offer comparable overall accuracy to raw EMA but provide more interpretable error attribution. This study provides the first systematic controlled benchmarks on cross-lingual bilingual non-English data, establishing baselines and revealing how anatomical and phonological differences compound and challenge domain generalization in AAI.

Key findings

Cross-gender mismatch causes Pearson correlation drops of approximately 0.05 to 0.10 relative to in-domain speaker-independent baselines.
Cross-language mismatch incurs larger correlation declines of about 0.10 to 0.20, showing language difference is a more significant domain shift than gender.
Combined language and gender mismatch produce the largest performance degradation, indicating anatomical and phonological factors compound.
Self-supervised learning (SSL) acoustic front-ends (Wav2Vec 2.0 and MMS-300m) outperform MFCCs consistently, both within and across domains.
BiLSTM inversion back-ends outperform lightweight Transformer attention models across all acoustic front-ends and articulatory targets, especially given limited training data.
Raw EMA coordinates and tract variables as articulatory targets yield comparable aggregate inversion accuracy (~0.4–0.5 Pearson r), but tract variables enable more diagnostic analysis of error sources.
Tongue sensors (TT, TB, TD) generally achieve higher correlation than lip sensors (UL, LL), with vertical (Z-axis) coordinates predicted more accurately than horizontal (X-axis).
Lip protrusion variables (LP) show weak acoustic recoverability, while tongue constriction locations degrade substantially under cross-language transfer due to language-specific articulation differences.

Threat model

The threat model considers natural domain shifts in acoustic-to-articulatory inversion where the adversary corresponds to an unseen speaker differing by gender or language within the dataset. The attacker has no access to the test speaker's articulatory data or detailed biomechanics. The challenge is to generalize AAI models to new speakers with different anatomical and phonological traits under speaker-independent evaluation. No active adversarial manipulation, signal tampering, or intentional spoofing is modeled.

Methodology — deep read

The paper investigates acoustic-to-articulatory inversion (AAI) as a regression task predicting articulatory trajectories from acoustic speech features under varying domain shifts. The threat model involves unseen speakers either differing by gender (within the same language) or by language (within the same gender), reflecting real-world speaker and language variability. No adversarial attacks are considered.

The core dataset is FROST-EMA, a controlled bilingual corpus of 18 Finnish and Russian speakers (8 female, 10 male) recorded with electromagnetic articulography (EMA) at 1250 Hz using five midsagittal sensors (upper lip, lower lip, tongue tip, blade, dorsum). The EMA coordinates are preprocessed through dropout interpolation, Butterworth low-pass filtering at 20 Hz, decimation to 50 Hz frame rate, and z-score normalization per utterance and sensor. Two articulatory target representations are evaluated: raw 10-dimensional EMA coordinates (X and Z per sensor), and 5-dimensional tract variables encoding lip aperture, lip protrusion, and tongue constriction locations computed per utterance.

The acoustic front-ends extracted are MFCCs (40 dim) and frozen self-supervised speech representations from Wav2Vec 2.0 Base (768 dim), XLSR-53 Large (1024 dim), and MMS-300m (1024 dim). These SSL models were pretrained on large multilingual corpora excluding FROST-EMA speakers.

Two inversion back-ends are benchmarked: a two-layer BiLSTM with hidden size 256 per direction feeding into a two-layer MLP outputting framewise articulatory estimates, and a lightweight Transformer encoder with 4 self-attention layers (4 heads, embedding dim = 256) plus a two-layer MLP output. Both are trained with MSE loss, Adam optimizer at lr=1e-3, batch size 8, early stopping on validation loss (10% random split) for up to 50 epochs. Input acoustic features and targets are windowed into 100-frame segments (~2 seconds).

Evaluation uses strict leave-one-speaker-out (LOSO) protocol to ensure speaker independence, combined with cross-gender and cross-language conditions isolating gender and language mismatches. Pearson correlation between predicted and true articulatory trajectories is computed per channel and aggregated. The authors conduct thorough ablations over front-ends, targets, and inversion models referencing domain-matched and -mismatched scenarios. Results are averaged across the four language-gender groups (Finnish male/female, Russian male/female).

The pipeline is reproducible in principle as FROST-EMA is publicly described; however, no direct code or pretrained weights release is mentioned. Some details like exact random seeds or hardware specs are not specified in the paper. Overall, the methodology carefully isolates anatomical and linguistic domain shifts and benchmarks multiple acoustic and model design choices in a controlled bilingual articulatory dataset.

Technical innovations

Introduction and systematic benchmarking of controlled cross-gender and cross-language domain shift evaluation protocols within a bilingual Finnish-Russian EMA corpus for AAI.
First systematic comparison of articulatory prediction targets including raw EMA coordinates versus interpretable constriction-based tract variables under domain mismatches.
Demonstration that frozen self-supervised learning speech features (Wav2Vec 2.0, MMS-300m) consistently outperform MFCCs for cross-domain articulatory inversion without fine-tuning.
Empirical evidence that BiLSTM inversion back-ends outperform lightweight Transformer self-attention models under low-resource articulatory training settings.
Diagnostic correlation analysis revealing greater acoustic variability in tongue constriction locations (TVs) across languages versus lip variables, highlighting the phonological basis of domain mismatch.

Datasets

FROST-EMA — 18 speakers (11 Finnish, 7 Russian; 8 female, 10 male) — bilingual controlled articulatory corpus with electromagnetic articulography at 1250 Hz; publicly described in [25]

Baselines vs proposed

MFCC + BiLSTM (FIN-M in-domain): Pearson r = 0.30 vs Wav2Vec 2.0 + BiLSTM: r = 0.40 on raw EMA targets
MFCC + Attn-lite (FIN-M in-domain): r = 0.24 vs Wav2Vec 2.0 + Attn-lite: r = 0.34 on raw EMA targets
XLSR-53 + BiLSTM underperforms Wav2Vec 2.0 + BiLSTM: r = 0.31 vs 0.40 in-domain
TV targets with Wav2Vec 2.0 + BiLSTM (FIN-M in-domain): r = 0.49 vs raw EMA targets r = 0.40
Cross-gender mismatch causes r drop approx. 0.05–0.10 from in-domain baseline (e.g., FIN-F to FIN-M: tongue TT drops from ~0.46 to ~0.34)
Cross-language mismatch causes r drop approx. 0.10–0.20 (e.g., FIN-M → RUS-M: from ~0.40 in-domain to ~0.30–0.34 cross-language)

Limitations

Limited speaker pools especially for Russian female group (only 2 speakers) reduce statistical power and robustness of those specific cross-domain results.
No fine-tuning of SSL front-ends was performed, which might improve results but was outside the scope.
The study focuses only on L1 language conditions excluding L2 and imitated accented speech scenarios present in FROST-EMA, limiting generality.
Evaluation is restricted to Pearson correlation metric without exploring perceptual or downstream task impact.
The lightweight Transformer attention model was relatively small; stronger or hybrid architectures might close the performance gap with BiLSTM but were not explored.
No adversarial or out-of-distribution examples were tested beyond defined cross-gender and cross-language shifts.

Open questions / follow-ons

How can speaker adaptation or domain adaptation techniques be incorporated to reduce cross-language and cross-gender degradation in AAI performance?
What is the impact of fine-tuning SSL acoustic front-ends on articulatory inversion robustness across diverse speaker and language domains?
Can larger or hybrid inversion back-end architectures combining recurrent and attention mechanisms provide better generalization given limited articulatory training data?
How do L2 and imitated accented speech conditions affect cross-domain articulatory inversion performance and what strategies mitigate such effects?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners focused on speech-based liveness or challenge-response systems using articulatory signatures, this study highlights the key domain shift challenges in modeling speaker-independent and cross-lingual articulatory patterns. Robust articulation estimation is important in secure voice biometrics or speech-driven human verification methods, but typical inversion models trained on English or single-gender data suffer significant degradation when applied to other languages or genders. The benchmark on FROST-EMA quantifies this performance drop and suggests system designers incorporate cross-domain evaluation, leverage self-supervised acoustic features, and prioritize recurrent modeling back-ends. Furthermore, interpretable tract variable targets provide valuable diagnostics to identify which articulatory components transfer well across domains, guiding more focused defense mechanisms. However, limitations in dataset diversity and real-world accented speech conditions indicate ongoing research is needed before deploying articulatory inversion for security-critical bot detection or CAPTCHA systems at scale.

Cite

bibtex

@article{arxiv2606_20478,
  title={ Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian },
  author={ Ruchi Pandey and Tomi Kinnunen },
  journal={arXiv preprint arXiv:2606.20478},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20478}
}

Beyond Speaker Independence: Evaluating Cross-Lingual Acoustic-to-Articulatory Inversion Across Finnish and Russian ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​