Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Source: arXiv:2605.30274 · Published 2026-05-28 · By Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang et al.

TL;DR

This paper addresses the critical challenge of document-level machine translation (DocMT) for ultra-long texts, where limited context windows in large language models (LLMs) constrain global coherence and redundant contextual information degrades translation quality. The authors propose LOONG, a human-like translation agent that mimics expert translators' cognitive workflows by maintaining a multi-granularity 3E memory module comprising Essence (summaries), Exemplars (sentence pairs), and Entities (terminology and knowledge entries). Unlike previous methods that attend passively to entire context histories, LOONG adaptively selects the most relevant context using a deep reasoning observe-and-act process optimized via reinforcement learning on preference data from sampled translation trajectories. Through this selective retrieval and reasoning, LOONG achieves markedly superior translation quality across English⇔Chinese, German, and French language pairs, with average gains up to 13.0 points on sentence- and document-level metrics. The approach exhibits strong generalization to out-of-domain data and unseen language pairs, as well as robustness to noisy, distracting context and ultra-long document translation scenarios where prior methods degrade or fail.

Key findings

LOONG improves average translation quality by up to 13.0 points across three evaluation metrics (sCOMET, dCOMET, LLM-as-Judge) compared to strong baselines including DELTA and Doc2Doc.
On News Commentary En ⇒Xx with Qwen3-8B backbone, LOONG achieves an sCOMET of 87.3 versus 86.9 for naive sentence-level baseline and 86.7 for DELTA, showing effective noise filtering during context utilization.
LOONG attains a 7.1 point improvement on LLM-as-Judge metric over DELTA (82.3 vs 75.2) in Xx ⇒ En direction, indicating superior discourse cohesion and coherence.
Ablation removing the Essence memory component causes the largest performance drop, indicating global semantic summaries are critical for maintaining document-level cohesion.
LOONG maintains stable translation quality even with injected pseudo-context noise of 30-50 sentences, whereas baselines degrade significantly (Figure 4).
Cross-lingual evaluation on six unseen languages shows consistent performance improvements, indicating the learned context strategy generalizes across language pairs.
On ultra-long documents (first 12 chapters of Journey to the West, 51,854 words), LOONG smoothly completes translation without failure, unlike Doc2Doc which fails after the context window exceeds model limits.
LOONG consistently outperforms baselines across four backbone LLMs (Qwen2.5/3, Llama3.1) of various sizes (7B to 14B parameters), demonstrating model-agnostic robustness.

Threat model

This work does not explicitly consider an adversarial threat model. The focus is on agentic context selection to improve translation quality and robustness under noisy or distracting context, assuming the attacker cannot manipulate model parameters or intervene in the inference pipeline.

Methodology — deep read

Threat Model & Assumptions: The adversary is implicit; the focus is on improving translation quality and robustness to noisy or irrelevant context rather than adversarial attacks. Assumes access to long documents where standard LLM context windows are limited and unfiltered context leads to quality drops. The agentic translation framework assumes oracle access to reference translations for preference data construction.
Data: Training data consists of documents >50 lines sampled from News Commentary V18.1 (approx. 500 documents per language pair for En⇔Zh, De, Fr). Evaluation covers in-domain News Commentary and WMT24++ benchmarks, and out-of-domain datasets GuofengV1 (literary webnovels), IWSLT2017 TED Talks (speech domain), and ultra-long classical Chinese novel Journey to the West (51,854 words).
Architecture/Algorithm: The LOONG agent processes documents segmented into 5-sentence chunks. It maintains a 3E memory module with:
- Essence: Semantic summaries of previous segments, retrieved by embedding similarity.
- Exemplars: Bilingual sentence pairs as style exemplars, retrieved similarly.
- Entities: Extracted bilingual named entities with type and attribute info. The agent performs a 3-step observe-and-act pipeline for each segment translation, sequentially observing Essence, Exemplar, and Entity candidates, producing reasoning thoughts and selecting relevant items. This reduces a combinatorial search space to a tractable additive one.

The agent generates multiple sampled reasoning actions (M=7) and translations per action (N=5) to build a preference dataset containing triples (observation, preferred action, dispreferred action) or (input, best translation, worst translation).

Training Regime: The agent is first cold-start supervised fine-tuned (SFT) on preferred instances for 1 epoch, batch size 64, learning rate 1e-5, on 4 NVIDIA A800 GPUs with DeepSpeed ZeRO optimization. Next, direct preference optimization (DPO) further fine-tunes the policy using the full preference dataset for 1 epoch with batch size 32 and learning rate 5e-6, applying LoRA adapters (rank 8) to reduce GPU memory consumption. Maximum input length set to 2560 tokens.
Evaluation Protocol: The evaluation metrics include sCOMET (sentence-level), dCOMET (document-level), and GPT-4-based LLM-as-Judge assessing general quality, cohesion, coherence, style, and terminology consistency (scored 1-100). Baselines include Sentence (isolated translation), Segment (chunk-based), Doc2Doc (full history), and DELTA (memory retrieval without filtering). Evaluations conducted on multiple domains, language pairs and for ultra-long documents. Ablations test removal of each memory component and training stage. Robustness tested by augmenting source with pseudo-context noise.
Reproducibility: Code is publicly released at the authors' GitHub repository. Datasets include common MT benchmarks, some data filtered for document length but publicly available. Model backbone weights (Qwen, Llama) are standard.

Concrete Example End-to-End: For a segment s_τ, LOONG retrieves top-K candidate contexts from Essence summaries, Exemplars, and Entities respectively, then observes Essence candidates first generating reasoning to select useful summaries. Next observes Exemplars to decide which sentence pairs help stylistic consistency, then evaluates Entities context for accurate terminology. For each step, multiple candidate actions are sampled; each action guides translation sampling N times with quality scored by sCOMET. The action with the highest average utility is labeled preferred; lowest is dispreferred, collecting data for policy optimization. Post training, inference uses one sampled action per step producing a final translation aligned to source segment with a recursive splitting algorithm ensuring strict sentence-level alignment. The memory is updated, and the process repeats for the next segment until the document is fully translated.

Technical innovations

Introduction of a 3E multi-granularity memory module (Essence summaries, Exemplars, Entities) that stores distinct contextual facets to enhance document-level translation.
An observe-and-act adaptive sequential context selection procedure that decomposes context reasoning across memory types, enabling deep reasoning to filter relevant historical context.
Use of reinforcement learning via direct preference optimization (DPO) on sampled observe-and-act trajectories to optimize the adaptive context policy.
A recursive alignment-enforced translation algorithm that ensures strict sentence-level alignment between source and target, enabling accurate evaluation and memory upkeep.

Datasets

News Commentary V18.1 — ~500 documents per language pair (En⇔Zh, De, Fr), filtered by length >50 lines — Public
WMT24++ — Unknown size, mixed news and other domains — Public via Huggingface
GuofengV1 — 1,445 documents, literary web novels — Public
IWSLT2017 — 1,939 TED talks documents, speech domain — Public
Journey to the West — 51,854 words (ultra-long classical Chinese novel) — Public

Baselines vs proposed

Sentence baseline: sCOMET = 86.9 vs LOONG = 87.3 on En ⇒Xx News Commentary (Qwen3-8B)
DELTA baseline: dCOMET = 75.2 vs LOONG = 82.3 on LLM-as-Judge metric (Xx ⇒ En)
Doc2Doc baseline: fails on ultra-long documents after context limit exceeded vs LOONG completes smoothly (Figure 1)
Ablation w/o Essence: average metrics drop from 80.2 to 79.0, largest drop among memory components
Ablation w/o Translation data: average drops to 63.6 vs 80.2 full model, showing necessity of combined strategy data

Limitations

Fixed segment length (5 sentences) may not align well with natural discourse boundaries, limiting coherence optimization.
Additional inference latency and compute arise from multi-step observe-and-act reasoning compared to single-pass translation.
Use of COMET as reward model for RL might imperfectly reflect human preferences on document-level translation quality.
Current experiments limited to En⇔Zh, De, Fr and some unseen language tests; more diverse or low-resource languages remain to be evaluated.
Memory retrieval sizes (Ks and Kx fixed to 4) are tunable hyperparameters that might not be optimal for all document lengths or domains.

Open questions / follow-ons

How to dynamically segment documents based on discourse or semantic boundaries rather than fixed sentence counts to further improve coherence?
Can efficiency bottlenecks from the multi-step reasoning and reinforcement learning optimization be reduced via distillation or speculative decoding?
Are there more accurate or human-aligned metrics than COMET for RL reward modeling in long document MT?
How does LOONG perform on truly low-resource languages or extremely heterogeneous domain documents?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, LOONG illustrates how intelligent adaptive context filtering combined with learned reasoning policies can robustly handle ultra-long, noisy input streams without degradation. This demonstrates the value of agentic, memory-augmented models that mimic human selective attention in complex language understanding tasks, which could inspire novel approaches for robust behavioral or language-based bot filtering where noisy or lengthy user inputs must be parsed carefully. LOONG's reinforcement learning strategy and multi-granular memory design may translate to improved methods for selecting relevant information in CAPTCHA challenge generation or context-aware bot detection pipelines, particularly when multi-turn user interactions or long histories are involved.

Cite

bibtex

@article{arxiv2605_30274,
  title={ Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection },
  author={ Yutong Wang and Xuebo Liu and Derek F. Wong and Zhilin Li and Rongqing Jiang and Min Zhang and Shimin Tao and Daimeng Wei and Min Zhang },
  journal={arXiv preprint arXiv:2605.30274},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30274}
}

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​