Automatic Labelling of Speech Translation Errors

Source: arXiv:2606.06047 · Published 2026-06-04 · By Dominik Macháček, Maike Züfle, Ondrej Klejch

TL;DR

This paper addresses the challenge of evaluating quality and confidence in Speech Translation (ST) systems by proposing the Speech Translation Error Labelling (STEL) task. The key problem is that errors in ST outputs reduce trustworthiness and have critical consequences, but there is currently no established methodology to automatically label and assess these errors considering both speech processing and translation contributions. The authors develop a novel annotation protocol tailored to end-to-end long-form speech translation that accommodates communicative goals and context, distinguishing error severities and types. They create a small multi-lingual test dataset with human annotations spanning four language directions (Cs→En, En→Cs, En→De, En→He) covering 32 minutes of speech.

They benchmark two state-of-the-art automatic systems for STEL: a text-only MT quality estimation model (XCOMET-XL) and a multimodal large language model (Qwen2.5-Omni) capable of processing audio and text. Their analysis shows these automatic systems achieve roughly half the precision of humans in error span detection and severity classification. Importantly, systems that directly process speech signals (multimodal) better detect speech-processing related errors, while text-only systems better detect translation-only errors. This complementarity suggests future STEL methods should incorporate direct speech input. Despite the limited dataset size, this foundational work and released resources provide a valuable step towards systematic automatic labelling and evaluation of speech translation errors.

Key findings

The automatic STEL systems achieve about 50% of human precision on error span labelling (e.g., unweighted F1 38.7% for XCOMET vs 71.9% human on En→Cs).
XCOMET (text-only) performs better at detecting translation-only errors (WER=0) while Qwen (multimodal) performs better at detecting speech-processing errors (WER>0).
Using gold transcripts instead of ASR transcripts yields only slight improvements (0–3.6 absolute points), indicating STEL complexity lies beyond speech recognition errors.
Annotator agreement on repeated annotations ranges from 56% to 72%, underscoring inherent task difficulty and annotation noise.
The STEL dataset covers 329 sentence-like segments over 32 minutes of audio across four language directions, requiring 6.5 hours of annotation effort.
Severity-weighted F1 scores are generally lower than unweighted, indicating difficulty in assigning precise error severities automatically.
Kendall’s τ correlation between automatic systems’ segment-level scores and human direct assessment varies widely depending on system and language direction (e.g., 32.5 vs 2.1 for Qwen on En→De vs En→Cs).
Text-only and speech-processing approaches are complementary for comprehensive speech translation error labelling.

Threat model

The adversary is not explicitly modeled since the work focuses on speech translation error labelling for evaluation purposes rather than a security threat scenario. The automatic STEL systems do not assume adversarial manipulation but are evaluated on realistic ST outputs with natural errors in speech processing and translation.

Methodology — deep read

The authors propose the Speech Translation Error Labelling (STEL) task, which focuses on identifying spans of errors in speech translation outputs along with their severity, from a user-centric and communication-informed perspective. The primary threat model is realistic user scenarios where ground-truth source transcripts are unavailable, and the system must estimate error spans and severities in ST output. No adversarial attacks are considered.

Data provenance: The dataset contains 32 minutes of audio segmented into 329 sentence-like segments across four language directions: Czech to English and English to Czech, German, and Hebrew. The segments were chosen from existing ST test sets (ELIT Robothon debate and ACL6060), with manual adjustments for segmentation and alignment. Annotation was performed by native speakers of target languages (Czech, German, Hebrew) who were English L2 speakers. Each annotator provided STEL span and severity annotations under a protocol inspired by prior text MT error span annotations but adapted for ST context and communication goals. Inter-annotator agreement was measured on repeated annotation of two directions.

Architecture and algorithms: The authors evaluate two existing automatic systems for performing STEL:

XCOMET-XL2, a text-only reference-free MT quality estimation model producing sentence-level quality scores and error spans. For STEL, it is given ASR transcripts as source input instead of gold transcripts to mimic realistic conditions.
Qwen2.5-Omni-7B, a multimodal large language model capable of ingesting both source speech (audio) and ASR transcripts to produce error span labels and severity classifications.

Training regime: The paper uses publicly available pretrained models; no additional training is done. Inference is performed on an NVIDIA A100 GPU.

Evaluation protocol: The authors evaluate span detection quality via character-level micro F1 scores, both unweighted and weighted by severity. Severity-weighted F1 applies partial credit for mismatched severity labels. They also measure correlation of predicted segment-level quality scores to human direct assessment scores using Kendall’s τ. Comparisons involve automatic STEL on ASR transcripts vs gold transcripts, and ablations on modality (audio+ASR vs ASR only, audio only).

Reproducibility: Code and dataset have been publicly released. Model weights are from public repositories. The dataset is limited in size and languages but intended to benchmark automatic STEL research.

Concrete example: For an English to Czech segment with recorded speech, an ST candidate translation is aligned sentence-wise. Annotators mark error spans and assign severity labels (Critical, Minor, Negligible, Redundancy). The automatic systems receive the ASR transcript and/or audio input and output error spans with these severity labels. The predictions are then compared via F1 overlap with the annotated spans to evaluate performance.

Technical innovations

A novel annotation protocol for speech translation error labelling emphasizing user-centric and communicative goals adapted from text MT error span annotations.
Creation of a new multi-lingual STEL test dataset with human-labelled error spans and severity for end-to-end long-form speech translation.
First systematic evaluation of automatic STEL using both text-only MT QE models (XCOMET) and multimodal LLMs (Qwen) processing speech and transcripts.
Demonstration that speech-processing inputs are necessary for detecting speech-related errors while text-only models excel at translation-only errors, advocating for complementary multimodal approaches.

Datasets

STEL Dataset — 329 segments, 32 minutes of speech — drawn from ELIT Robothon and ACL6060 test sets, manually annotated for error spans and severity.

Baselines vs proposed

XCOMET (ASR transcripts) F1 unweighted = 38.7% vs human annotator F1 unweighted = 71.9% on En→Cs
Qwen (ASR+audio) weighted F1 = 23.9% vs XCOMET (ASR) weighted F1 = 15.1% average across languages
XCOMET (ASR) weighted F1 = 10.78% on speech-processing errors vs 9.95% on translation-only errors (Table 3)
Qwen (ASR+audio) weighted F1 = 12.98% on speech-processing vs 8.80% on translation-only errors
Using gold transcripts increases weighted F1 by up to 3.6 points (varies by language/direction)
Human annotator repeated annotations achieve 56–72% agreement, highlighting task difficulty

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.06047.

Fig 1

Fig 1 (page 3).

Fig 1

Fig 1: Screenshot of the annotation platform with instructions for the annotators. It is a version of Pearmut

Limitations

Dataset is small (only 32 minutes speech, 329 segments) and limited to four European language directions; may not capture variability in domains, accents, emotions.
Only one annotator per language direction, which restricts evaluation of inter-annotator variability and ground-truth reliability.
No evaluation on out-of-domain or noisy speech, or more challenging speech phenomena like dialects, emotions, or spontaneous speech.
Limited investigation of alternative ST and STEL model architectures or hyperparameter tuning.
Automatic systems rely heavily on quality of ASR transcripts; use of gold transcripts only marginally improves results indicating system and task limits.
Severity taxonomy is fixed in XCOMET and not easily customizable, may limit generalizability to other use cases.

Open questions / follow-ons

How to scale STEL annotations with multiple annotators and across more diverse languages, domains, and speech styles?
Can multimodal large language models be further adapted or finetuned specifically for STEL to improve error severity classification beyond current half-human precision?
What is the impact of segmentation errors on STEL accuracy and how can systems be made robust to imperfect segmentation in long-form speech?
How to integrate STEL error labelling into interactive systems for confidence estimation, user alerting, or real-time post-editing in practical speech translation applications?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners working at the intersection of speech and text, this work illustrates the complexity of automatically evaluating speech translation quality, a capability potentially relevant for assessing speech-based user inputs or voice-driven interactions requiring verification or understanding. The finding that combining speech and text modalities improves error detection over text-only signals highlights that systems defending against automated abuse should consider multimodal inputs for robustness. Furthermore, the STEL annotation protocol focusing on human communication goals underscores the importance of context-aware error detection beyond surface-level text correctness, which could inspire more sophisticated signal quality and intent estimation in voice-driven bot detection. However, current automatic methods reach only about half the accuracy of humans, signaling that real-world systems should be cautious in fully automating speech translation error detection without human-in-the-loop verification or fallback mechanisms.

Cite

bibtex

@article{arxiv2606_06047,
  title={ Automatic Labelling of Speech Translation Errors },
  author={ Dominik Macháček and Maike Züfle and Ondrej Klejch },
  journal={arXiv preprint arXiv:2606.06047},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.06047}
}

Automatic Labelling of Speech Translation Errors ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​