What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study
Source: arXiv:2606.09717 · Published 2026-06-08 · By Zhu Li, Shekhar Nayak, Matt Coler
TL;DR
This study addresses the challenge of understanding which specific prosodic features independently drive sarcasm perception in speech. Prior research has relied on naturally produced speech where prosodic cues such as pitch, loudness, and speech rate co-vary, making it difficult to isolate their individual effects. The authors leverage a modern neural text-to-speech (TTS) system with prompt-based prosody control to synthetically manipulate speech rate, pitch variation, and loudness orthogonally. They generate a carefully selected stimulus set controlling these three dimensions independently and collect human listener sarcasm and naturalness ratings on these stimuli. They also evaluate a multimodal foundation model’s sarcasm and naturalness predictions on the same stimuli for comparison.
Results indicate that human sarcasm perception relies primarily on loudness increases, especially combined with flat pitch contours, while speech rate and pitch variation show weaker or nonsignificant effects. However, the foundation model weights speech rate more heavily, assigning higher sarcasm ratings to slowed speech, but pays little attention to loudness. This reveals a clear divergence between human and artificial prosody cue weighting for sarcasm recognition. The study showcases how controllable neural TTS enables rigorous causal testing of prosodic contributions to pragmatic speech perception.
Key findings
- Human sarcasm perception is significantly driven by loudness (β = 0.285, p = .017), with louder utterances rated as more sarcastic.
- Speech rate (β = 0.061, p = .617) and pitch variation (β = 0.138, p = .248) showed no significant main effects on human sarcasm ratings.
- Human naturalness ratings favored fast speech (β = 0.090, p = 3.1×10⁻⁶) and soft loudness (β = 0.113, p = .0006), while pitch contour had no significant effect on naturalness (β = −0.041, p = .289).
- Foundation model sarcasm ratings were significantly influenced by speech rate (β = 0.313, p = .009), rating slow speech as more sarcastic, but not by loudness (β = 0.035, p = .773) or pitch (β = 0.132, p = .272).
- Model and human sarcasm cue-weighting patterns showed no rank correlation across conditions (Spearman ρ = −0.11, p = .26), indicating divergent prosodic feature reliance.
- Orthogonal prosodic manipulations yielded large effect sizes on intended dimensions (pitch variation d = 1.14; loudness d = 0.81; duration d = 1.76), with minimal spillover (all |d| < 0.25).
- Human inter-rater reliability was modest at individual level (ICC(2,1) = 0.15) but high after aggregation (ICC(2,k) = 0.92) for sarcasm ratings, supporting stability of averaged judgments.
- Foundation model prediction reliability across random seeds was moderate for sarcasm (ICC(2,k) = 0.80) but lower for naturalness (ICC(2,k) = 0.67).
Threat model
n/a — The paper is focused on psycholinguistic perception mechanisms rather than security threats or adversarial settings.
Methodology — deep read
Threat model and assumptions: The study does not focus on adversarial threats but on psycholinguistic perception. The goal is to identify causal contributions of prosodic cues to sarcasm perception when lexical content and speaker identity are controlled. Listeners are assumed to have typical English proficiency and experience with sarcasm interpretation. The computational foundation model is assumed to simulate perceptual processes without contextual information.
Data provenance, size, labels, splits, preprocessing: The linguistic stimulus set was adapted from Bryant and Fox Tree (2002), consisting of 24 semantically neutral English short utterances known to be interpretable as sincere or sarcastic depending on prosody or context. Each utterance was synthesized under eight prosodic conditions (2 levels each of pitch variation: dynamic vs. flat; loudness: loud vs. soft; speech rate: fast vs. slow) yielding 192 stimuli total.
Synthetic stimuli were generated using Qwen3-TTS-12Hz-1.7B-CustomVoice model with a single synthetic voice to eliminate inter-speaker variability. For each utterance-condition pair, 100 candidate samples were generated with stochastic decoding (varied random seeds and temperature) and acoustic features (pitch SD, loudness RMS, duration) were extracted. Stimulus selection aimed to maximize effect size in target prosodic dimension while minimizing spillover to others, ensuring orthogonality confirmed by Cohen’s d measures.
- Architecture / algorithm: The core novel component is the use of prompt-based prosodic conditioning of a large neural TTS system, specifying instructions for speech rate, pitch contour shape, and intensity to produce fine-grained, orthogonally manipulated speech samples.
The foundation model for perception evaluation is Qwen3-Omni, a large-scale multimodal foundation model with audio input processing capability, prompted to simulate sarcasm perception based on prosody.
Training regime: Not applicable for human subjects; model inference was repeated six times with different random seeds for robustness and averaged. No training of models reported as part of this paper.
Evaluation protocol: Human participants (N=66 native/nearly native English speakers) listened online to 24 utterances each, covering all 8 conditions, rating sarcasm and naturalness on 5-point Likert scales. Reliability assessed by ICC, results aggregated over participants. Mixed-effects linear models analyzed fixed effects of speech rate, pitch, loudness with random intercepts for participant and utterance. Pairwise Tukey-corrected post-hoc tests examined condition differences.
Model evaluations used identical stimuli with six seeds, averaging multi-seed results. Model gave sarcasm and naturalness ratings + categorical label with built-in linguistic explanation. Statistical analyses analogous to human data.
- Reproducibility: The Qwen3-TTS and Qwen3-Omni models are publicly released. Stimuli and analysis procedures are described in detail including orthogonality criteria, acoustic validation, and statistical models, supporting reproducibility. No explicit code release noted; the dataset of stimuli and ratings may be restricted.
Example end-to-end: For utterance "great job" synthesized under slow flat loud condition, multiple candidates were generated via sampling, acoustic features extracted, and the exemplar with maximal loudness effect and minimal pitch/duration spillover was selected. Listeners rated this stimulus high in sarcasm and moderate in naturalness, while the model assigned sarcasm largely influenced by slow rate but less by loudness, illustrating how prosodic control differentiates perception.
Technical innovations
- Use of prompt-based conditioning on a neural TTS to orthogonally manipulate pitch variation, loudness, and speech rate in synthetic speech.
- A novel stimulus selection procedure optimizing effect size in target prosodic dimensions while minimizing cross-dimensional spillover to establish orthogonality.
- Comparative analysis framework integrating controlled synthetic speech perception by both humans and a multimodal foundation model for sarcasm recognition.
- Demonstration that loudness is a causal driver of human sarcasm perception, contrary to reliance on speech rate by state-of-the-art computational models.
Datasets
- Bryant and Fox Tree stimulus set — 24 English short utterances — adapted from [1]
- Synthetic speech stimuli — 192 utterances (24 utterances × 8 prosodic conditions) — generated via Qwen3-TTS
Baselines vs proposed
- Human sarcasm ratings: loudness effect β = 0.285 (p = .017) vs. speech rate β = 0.061 (p = .617) vs. pitch variation β = 0.138 (p = .248)
- Model sarcasm ratings: speech rate effect β = 0.313 (p = .009) vs. loudness β = 0.035 (p = .773) vs. pitch variation β = 0.132 (p = .272)
- Inter-rater reliability human sarcasm ICC(2,k) = 0.92; model sarcasm ICC(2,k) = 0.80
- Cohen’s d for prosodic dimension manipulations: pitch variation 1.14; loudness 0.81; duration 1.76; non-target effects near zero (< 0.25)
Limitations
- No contextual or multi-speaker variability was included; all stimuli used a single synthetic speaker voice, limiting ecological validity.
- The orthogonality between prosodic dimensions is statistical, but minor unmeasured acoustic confounds may remain.
- Human listeners evaluated stimuli devoid of contextual cues, which modulate sarcasm perception in natural settings.
- Model evaluations were conducted on a single foundation model (Qwen3-Omni); results may not generalize to other architectures.
- The human subject sample size (N=66) and the limited number of utterances (24) constrain generalizability across languages or utterance types.
- No adversarial or noisy conditions were tested; robustness of cue-weighting under real-world conditions remains unknown.
Open questions / follow-ons
- How do prosodic cue weightings for sarcasm perception generalize across multiple speakers and languages beyond English?
- What is the role of context, discourse, and interlocutor knowledge in modulating prosodic sarcasm cues?
- Can multimodal models be trained or fine-tuned to better align their sarcasm perception cue-weighting with human listeners?
- How do other prosodic or voice quality features (e.g., voice timbre, creakiness) influence sarcasm perception alongside pitch, loudness, and rate?
Why it matters for bot defense
This work highlights that fine-grained control and causal manipulation of prosodic features in synthetic speech is feasible using recent neural TTS systems, which can produce highly natural yet systematically varied stimuli. For bot-defense and CAPTCHA applications involving speech, understanding which acoustic cues humans rely on for pragmatic intent perception (here, sarcasm) and how machine models differ in cue weighting is critical. It suggests that automated systems may misinterpret prosodic signals due to differing sensitivities, potentially leading to false positives or negatives in voice-based authentication or bot detection scenarios.
Furthermore, the approach of prompt-based prosody control combined with rigorous acoustic validation presents a methodological template for generating test stimuli to benchmark voice-driven behavioral detectors or challenge-response tests. The divergence between human and model perception underscores the need to incorporate human-aligned prosody sensitivity in bot-defense speech systems to avoid semantic spoofing or manipulations that exploit differences in prosody interpretation.
Cite
@article{arxiv2606_09717,
title={ What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study },
author={ Zhu Li and Shekhar Nayak and Matt Coler },
journal={arXiv preprint arXiv:2606.09717},
year={ 2026 },
url={https://arxiv.org/abs/2606.09717}
}