WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Source: arXiv:2605.13846 · Published 2026-05-13 · By Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

TL;DR

WARDEN addresses the critical challenge of transcribing and translating Wardaman, a critically endangered Australian indigenous language, with only 6 hours of annotated audio data available. Unlike typical end-to-end speech-to-text translation systems that rely on large datasets, WARDEN adopts a two-stage approach: first transcribing the audio into phonemic script, then translating that transcription into English. To overcome data scarcity, it uses a novel cross-lingual initialization of the ASR model tokens from Sundanese—identified as the phonologically closest language—and integrates expert-curated Wardaman-English dictionary entries into a large language model (LLM) translation stage. This two-step, knowledge-augmented design allows WARDEN to outperform larger open-source and proprietary models despite extremely limited training data. Empirically, WARDEN achieves the lowest word error rate (WER) for transcription and improves translation BLEU scores substantially compared to unified or zero-shot approaches.

Key findings

Using Sundanese token initialization for fine-tuning Whisper large-v3 reduces transcription WER from 0.64 to 0.52, a 0.12 absolute improvement.
WARDEN fine-tuned transcription model outperforms zero-shot Whisper, Speech2Text, and Wav2Vec2 baselines; best baseline WER is 0.64 vs WARDEN's 0.52.
Lexicon conditioning plus LoRA fine-tuned Qwen3-8B LLM raises translation BLEU-4 to 12.40, a +6.28 improvement over fine-tuning alone (6.12 BLEU) and +4.86 over best GPT-5 method (7.54 BLEU).
Translation using ground truth transcription and lexicon conditioning (oracle) achieves BLEU-4 of 16.42, showing upper bound for system.
Data augmentation via concatenated utterances and inclusion of ASR-predicted transcripts improves translation BLEU by up to 6.23 points.
Phoneme distance computed by Hamming distance on PHOIBLE inventories correlates with transcription WER; Sundanese shows shortest distance and best zero-shot and fine-tuned performance.
Lexicon matcher uses character error rate threshold of 0.2 and top-3 candidate lexicons retrieval for best translation accuracy.
Removing lexicon conditioning or fine-tuning decreases BLEU by 6.28 and 9.57 respectively; both missing causes a 10.43 BLEU drop.

Threat model

n/a — This paper focuses on low-resource language transcription and translation challenges, and does not address adversarial threat models or security concerns.

Methodology — deep read

The research tackles the low-resource challenge for endangered language transcription and translation with only 6 hours of annotated audio from Wardaman.

Threat Model and Assumptions: The adversary model is not applicable; this is a low-data setting problem rather than security. They assume only limited aligned speech, transcription, and translation data (6 hours). The LLM and ASR models are pretrained on large corpora but not including Wardaman. Linguistic expert knowledge in the form of lexicons is available.
Data: The dataset comes from longitudinal linguistic fieldwork with 98 recorded sessions, around 6 hours (~23,436 seconds) of handwritten ELAN-annotated audio/video recordings. The data include primary tiers of time-aligned Wardaman transcriptions and English translations. The Wardaman-English dictionary contains roughly 2,000 cleaned lexical entries from a FieldWorks database including part-of-speech and glosses covering about 30% of corpus vocabulary.
Architecture & Algorithm:

Transcription stage: Starts from Whisper-large-v3 model pretrained on many languages but excluding Wardaman. Initialization of the Wardaman language token is replaced with the Sundanese language token due to phoneme similarity (measured by Hamming distance on phoneme inventories from PHOIBLE). The model fully fine-tunes all parameters with a low learning rate.
Translation stage: Uses a large language model (Qwen3-8B and GPT-5) fine-tuned with Low-Rank Adaptation (LoRA) technique. Transcription outputs from stage one are matched against lexicon entries using character error rate (CER) < 0.2 and affix matching. Retrieved lexicon candidates (top 3) plus transcription are concatenated with system prompts instructing the LLM to perform translation conditioned on lexical cues.

Training Regime:

Transcription fine-tuned on 8 NVIDIA RTX 3090 GPUs using DeepSpeed zero-2 with learning rate 0.0001 and batch size 4. Full parameter tuning of Whisper.
Translation fine-tuned with LoRA on DeepSpeed zero-2 with learning rate 0.001, batch size 2. Training uses data augmentation: short utterances, concatenated longer utterances, and noisy ASR-predicted transcripts.

Evaluation Protocol:

Transcription quality evaluated by Word Error Rate (WER) on held-out test data with no speaker overlap.
Translation evaluated using BLEU-4 scores comparing system outputs to human translations.
Baselines: zero-shot Whisper, Wav2Vec2, Speech2Text, fine-tuned Whisper, and LLM baselines with and without lexicon conditioning and fine-tuning.
Ablations on lexicon use, token initialization, input augmentations, CER and top-k settings for lexicon matching.

Reproducibility:

Authors provide code and data.
The dataset is partially public via ELAR and FLEx dictionaries, though access may require registration.
Training details and hyperparameters are described sufficiently for replication.

Example: An input Wardaman audio segment is first processed by Whisper ASR fine-tuned from Sundanese token initialization to yield a phonemic transcript. The transcript words are matched against lexicon candidates via CER and affix patterns to retrieve definitions and parts of speech. These are formatted with system prompts and provided to a LoRA fine-tuned Qwen3-8B model that outputs an English translation. Augmentations such as concatenated utterances and noisy transcripts improve robustness. The two-stage approach helps cope with the extreme data scarcity.

Technical innovations

Cross-lingual ASR token initialization using the closest phonologically related proxy language (Sundanese) improves fine-tuning on extremely low-resource Wardaman data.
Two-stage separated modeling approach (transcription then translation) avoids data-hungry unified speech-to-text translation models under severe data constraints.
Lexicon-enhanced translation by retrieving relevant dictionary entries via character error rate and affix matching to augment LLM input, improving translation accuracy.
Use of LoRA fine-tuning on large language models to efficiently inject domain-specific lexical knowledge into translation with limited training data.

Datasets

Wardaman ELAN-annotated corpus — ~6 hours annotated speech segments from 98 recordings — sourced from long-term linguistic fieldwork archived at ELAR
Wardaman-English dictionary — ~2,000 entries — compiled and cleaned from FLEx (FieldWorks Language Explorer) software

Baselines vs proposed

Speech2Text (no fine-tuning): WER = 2.16 vs WARDEN Whisper fine-tuned with Sundanese init WER = 0.52
Wav2Vec2 (fine-tuned): WER = 0.81 vs WARDEN 0.52
Whisper (fine-tuned): 0.64 vs WARDEN (fine-tuned with Sundanese init) 0.52
Qwen3-8B fine-tuned no lexicon BLEU-4 = 6.12 vs WARDEN (lexicon+fine-tune) BLEU-4 = 12.40
GPT-5 no fine-tune lexicon BLEU-4 = 7.54 vs WARDEN (Qwen3-8B lexicon+fine-tune) 12.40
Whisper audio-to-text translation fine-tuned BLEU-4 = 1.42 (lowest)
Oracle (ground-truth transcript + lexicon + fine-tune): BLEU-4 = 16.42

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.13846.

Fig 1

Fig 1: Overview of the WARDEN system. For transcription,

Fig 2

Fig 2: Similarity visualization of six candidate languages to

Fig 3

Fig 3: LLM input organization for lexicon-augmented trans-

Fig 4

Fig 4: An example of lexicon matching. For each word in the Wardaman transcription result, the matcher retrieves the most relevant

Limitations

Extremely small dataset (~6 hours) limits generalization and robustness of transcription and translation.
No adversarial evaluation or robustness tests under noise, speaker variation, or domain shift beyond test set withholding.
Lexicon coverage is only about 30% of the corpus vocabulary, limiting potential lexicon-based translation improvements.
Translation evaluation solely based on BLEU; semantic adequacy or user-centric measures not reported.
LLM fine-tuning requires significant compute resources which may not be feasible for all language documentation projects.
The system relies on expert-curated dictionaries and linguistic knowledge which may not exist for other endangered languages.

Open questions / follow-ons

Can the two-stage approach generalize to other extremely low-resource endangered languages without close proxy languages like Sundanese?
How effective is the lexicon-augmented LLM translation when lexicon coverage is sparse or incomplete?
What domain adaptations or robustness improvements can be added for transcription and translation under varying recording conditions or speaker diversity?
Could self-supervised or unsupervised pretraining on unannotated Wardaman audio improve performance further?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners working in language understanding or voice input, WARDEN illustrates effective strategies for multimodal low-resource language processing under extreme data constraints. The separation of transcription and translation stages combined with targeted knowledge injection (via phonological similarity and lexicon matches) enables robust utility of large pretrained models in tasks where direct end-to-end fine-tuning fails due to data scarcity. This approach highlights how domain-specific auxiliary knowledge and cross-lingual transfer can substantially improve performance when labeled examples are rare — principles transferable to bot detection systems relying on NLP for low-resource languages or dialects. Additionally, the paper demonstrates practical use of Low-Rank Adaptation (LoRA) for efficient LLM fine-tuning with scarce data, relevant for adaptation of general models to niche security-relevant languages or regional speech patterns. Hence, the techniques may inspire more data-efficient voice CAPTCHA designs and improved bot detection in underrepresented languages through careful decomposition and expert knowledge infusion.

Cite

bibtex

@article{arxiv2605_13846,
  title={ WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data },
  author={ Ziheng Zhang and Yunzhong Hou and Naijing Liu and Liang Zheng },
  journal={arXiv preprint arXiv:2605.13846},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.13846}
}

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​