Skip to content

MAJIC: Leveraging Articulatory Motion for Speech-based Emotion Recognition

Source: arXiv:2606.18228 · Published 2026-06-16 · By Tanmay Srivastava, Paras Bhavnani, Benjir Alvee Islam, Shubham Jain

TL;DR

MAJIC introduces a multimodal speech emotion recognition (SER) system that integrates articulatory motion data from jaw and facial muscles, captured via wearable IMU sensors near the temporomandibular joint (TMJ), with traditional audio features. The motivation arises from a significant performance drop of prior SER models trained on acted, exaggerated emotional speech when applied to subtle, spontaneous expression by non-actors. MAJIC addresses this gap by engineering novel features from low-frequency jaw motion, mid-frequency facial vibrations, and high-frequency IMU components reflecting speech-induced vibrations. These articulatory signals complement audio prosodic features and help capture pre-speech emotional cues unavailable to audio-only systems. Evaluated on a dataset of 20 multilingual participants across multiple sessions and speech scenarios, including prompted and conversational speech, MAJIC achieves 93% accuracy and 91% F1 score, outperforming strong audio and prior multimodal baselines by substantial margins.

The core innovation of MAJIC lies in leveraging a lightweight, privacy-preserving IMU sensor to capture biomechanical articulatory features linked to emotional states, combined with a multi-task learning model that jointly predicts emotion category, valence, and arousal, incorporating semantic relationships via pretrained text embeddings. The system demonstrates robustness across users, languages, and less exaggerated emotional speech, using personalized adaptation with limited per-user data. These results highlight the complementary value of articulatory motion sensing for robust real-world speech emotion recognition beyond constrained scripted datasets.

Key findings

  • MAJIC achieves 93% accuracy and 91% F1 score on a dataset of 20 participants spanning 10 languages, surpassing audio-only baselines by at least 26.2% F1 score.
  • State-of-the-art audio-only SER accuracy drops from ~85% on professional actor speech to 65% for non-actors; combined audio+text models drop from 88% to 44.3%.
  • Jaw motion features (0-5 Hz) such as sharpness and smoothness reliably distinguish emotions, with anger showing sharp, irregular motion and sadness smoother, more regular motion.
  • Facial vibration features (5-20 Hz) characterize muscle tension patterns, with high tremor spectral power for fear and disgust and low tremors for happiness and neutral states.
  • High-frequency IMU components (20-50 Hz) capture speech-induced mechanical vibrations correlating with emotional arousal; anger and disgust exhibit higher energy than fear or sadness.
  • Pre-speech articulatory motion analysis (jaw displacement and build-up rate) provides emotion-discriminative cues preceding vocalization.
  • Multi-task learning combining categorical emotions and valence-arousal prediction, guided by semantic relationships from RoBERTa embeddings, improves emotion recognition robustness.
  • Leave-one-user-out cross-validation with 20% user-specific fine-tuning achieves consistent generalization across languages, speech scenarios, and phrase variations.

Threat model

The adversary is the inherent variability and subtlety of emotional expression by everyday, non-actor speakers, whose emotional cues in speech are less exaggerated and harder to classify than acted speech. The system assumes no direct adversarial attacks but must contend with ambient body and head motion noise interfering with the jaw motion sensing. It cannot rely on intrusive, privacy-sensitive modalities or professional actor datasets. It cannot control or know the speaker's linguistic content or emotional intensity a priori.

Methodology — deep read

The study targets the threat model of non-actor speakers who exhibit subtle, non-exaggerated emotional speech patterns that challenge conventional SER trained primarily on acted data. The adversary in this context is variability in natural emotion expression that reduces recognition accuracy. The method assumes access to both multimodal sensor data and participant self-reported emotion labels but no invasive or privacy-sensitive modalities.

Data was collected from 20 participants (12 male, 8 female) fluent in English plus additional languages, across two recording sessions separated by at least three days. The dataset includes 6 emotion categories (Happy, Sad, Fear, Disgust, Anger, Neutral) recorded in 6 scenarios: different phrases, same phrases, native language speech, paragraph reading, conversational responses, and noisy conditions. Signals were captured via a twin-IMU setup: one on the TMJ to acquire jaw and facial motion (gyroscope and accelerometer at 100 Hz), another on the temporal bone as a reference for noise cancellation, plus a microphone recording audio at 48 kHz. Labels were obtained from participant self-reports and verified with high inter-rater consistency (Pearson r=0.93). Data preprocessing included gravity compensation and adaptive noise removal with an LMS-filtered FIR adaptive filter leveraging the reference IMU to isolate jaw motion from body/head movement.

Feature engineering isolated IMU signals into three frequency bands reflecting jaw motion (0-5 Hz), facial vibrations (5-20 Hz), and high-frequency facial vibrations transmitted through bone (20-50 Hz). From jaw motion, features extracted include sharpness (second derivative magnitude), smoothness (spectral arc length), activity bursts and inter-burst interval durations, and pre-speech jaw displacement and build-up rates. Facial vibration features involved spectral power, dominant frequency, and frequency spread to capture muscle tremor and tension patterns. High-frequency IMU components quantified energy variation correlated with emotional arousal. Wavelet multi-resolution analysis over the raw IMU signals quantified energy and entropy at multiple temporal scales to distinguish rapid from slow articulatory patterns. Standard statistical measures such as mean, standard deviation, skewness, kurtosis, and crest factor were computed from the IMU signals.

Audio features used the emobase set extracted by openSMILE, including 26 low-level descriptors and delta coefficients related to pitch, formants, spectral characteristics, voicing probability, intensity, loudness, and line spectral frequencies, recognized as robust audio features for expression recognition.

The learning architecture utilizes a multi-task ensemble model combining Support Vector Machines, Neural Networks, and Nearest Neighbor classifiers for emotion classification. The model simultaneously predicts categorical emotions and auxiliary valence-arousal outputs discretized into coarse levels based on established mappings, with losses combined by empirically chosen weights. To better capture semantic proximity among emotions, joint embedding distances from RoBERTa pretrained text embeddings encode label relationships, optimized by a cosine-distance based metric loss with a margin to maintain separation between distinct emotions.

Training employed leave-one-user-out cross-validation: for each held-out test user, 20% of that user's data was used for fine-tuning to adapt to personalization effects, with the remaining 80% for testing. This protocol tested robustness to unseen speakers and limited data while enabling small user-specific adaptation. Hyperparameters were tuned on baseline models and carried over for the proposed system to ensure fair comparison.

Evaluation metrics are classification accuracy and F1 score, the latter emphasizing performance across imbalanced classes. Baselines include an audio-only model fine-tuned Emo2Vec, an IMU-derived audio spectrogram approach transforming jaw motion to mel spectrograms, and a combined audio + IMU baseline from Jawthenticate extracting intonation and rhythm features from jaw motion. The proposed MAJIC model significantly outperformed all baselines across all scenarios. Statistical significance and ablation results were reported showing the contribution of each articulatory frequency band and multi-task losses. Code release status is not stated; dataset is user study derived and presumably private.

Technical innovations

  • Utilization of a single wearable IMU sensor near the TMJ to non-invasively capture multi-band articulatory motion (jaw, facial muscle vibrations, high-frequency speech-induced vibrations) for SER.
  • Engineering novel articulatory motion features including sharpness, smoothness, activity bursts, and pre-speech motion characteristics reflecting subtle emotional cues absent from audio.
  • A multi-modal multi-task learning framework combining audio features with IMU articulatory features and semantic emotion embeddings from RoBERTa to jointly predict emotions and valence-arousal dimensions.
  • Adaptive noise cancellation of motion artifacts using a twin-IMU setup and LMS-filtered FIR filtering to isolate jaw articulatory signals from head/body movement.

Datasets

  • MAJIC user study dataset — 20 participants, multiple languages (10), six emotion classes, 2 recording sessions each — private, collected by authors

Baselines vs proposed

  • Emobase audio-only baseline: F1 score = ~64.8% vs MAJIC proposed: 91% F1 score
  • IMU to Audio [24]-derived baseline: performance less than MAJIC by a significant margin (exact % not specified)
  • Audio + IMU features baseline from Jawthenticate [23]: outperformed by MAJIC by an average 26.2% increase in F1 score

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.18228.

Fig 5

Fig 5: Overview of MAJIC. We extract articulatory and audio features, com-

Fig 7

Fig 7: illustrates IBI, where sad and neutral states tend to

Fig 9

Fig 9: Different emotions embedding space from text projected in 2D using

Fig 10

Fig 10: Some example phrases from our user study. We have phrases of

Fig 11

Fig 11: The mean F1-Score for our system is greater than 90%, surpassing the other baselines by at least 26.2% on average.

Limitations

  • Dataset size is limited to 20 participants, which may not capture full variability in population or speech patterns.
  • Evaluation primarily on prompted and semi-controlled conversational speech; spontaneous natural emotional speech scenarios not extensively tested.
  • Emotion labels rely on self-report and limited annotator validation, with coarse discrete valence-arousal mapping instead of continuous affect labels.
  • No explicit adversarial robustness or cross-corpus validation conducted to test generalization beyond the collected dataset.
  • High-frequency IMU components above 50 Hz are not considered; detailed signal-noise analysis of wearable sensors under practical usage conditions is not reported.
  • Code and dataset release status is unclear, which may limit reproducibility and comparison by other researchers.

Open questions / follow-ons

  • How does MAJIC generalize to spontaneous, in-the-wild emotional speech beyond prompted scenarios and scripted phrases?
  • Can the addition of continuous valence-arousal annotations, rather than discretized labels, improve model fidelity and emotion nuance capture?
  • What are the effects of long-term user adaptation and day-to-day variability in jaw motion patterns on recognition robustness?
  • How does the system perform on larger, more diverse populations and languages not represented in the current dataset?

Why it matters for bot defense

For practitioners in bot-defense and CAPTCHA, MAJIC’s approach reveals the promise of lightweight wearable sensors capturing articulatory biomechanics to supplement audio in emotion recognition. This can potentially strengthen voice-based user authentication and liveliness detection by incorporating subtle physiological motion signatures difficult for synthetic speech or replay attacks to replicate. The integration of multi-band articulatory features with semantic multi-task learning also suggests avenues for richer, context-aware behavioral biometrics. However, deploying such IMU sensors widely in authentication may face usability and privacy adoption challenges.

From a CAPTCHA designer’s perspective, leveraging articulatory motion signals could enhance bot-vs-human differentiation by analyzing biomechanical articulations coupled with vocal cues. This could augment audio CAPTCHAs or continuous authentication systems by detecting natural motion-emotion signatures, which are unlikely to be synthesized or replayed accurately. Yet, the current system’s reliance on wearables and personalized model adaptation highlights tradeoffs between robustness and practicality that would need careful consideration in real-world security applications.

Cite

bibtex
@article{arxiv2606_18228,
  title={ MAJIC: Leveraging Articulatory Motion for Speech-based Emotion Recognition },
  author={ Tanmay Srivastava and Paras Bhavnani and Benjir Alvee Islam and Shubham Jain },
  journal={arXiv preprint arXiv:2606.18228},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.18228}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution