VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation
Source: arXiv:2605.24675 · Published 2026-05-23 · By Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen
TL;DR
This paper addresses the challenging problem of multilingual Web image translation, where text embedded in images on the web (e.g., social media posts or e-commerce product images) must be accurately recognized and translated. Standard Large Vision-Language Models (LVLMs) struggle because their visual encoders prioritize coarse semantics and fail to capture the fine-grained visual details necessary to disambiguate diverse character morphologies and layouts in web content. To close this "visual representation gap," the authors propose VaaWIT, an end-to-end adaptation framework that injects detailed visual features into frozen large language models (LLMs) with minimal computation overhead.
VaaWIT introduces two key components: a Dual-Stream Attention Module (DSAM) that fuses semantic and fine-grained visual details via bidirectional cross-attention, and a Visual-Aware Adapter (VAA) that dynamically modulates frozen LLM layers with the fused visual context through a gated adapter. This combination enables deeper synergy between image-level semantics and spatially precise character morphology for robust multilingual translation, including low-resource languages. Extensive experiments on eight translation tasks across three public benchmarks demonstrate that VaaWIT significantly outperforms open-source SOTA cascaded and end-to-end LVLM baselines and achieves comparable or better results than leading proprietary models such as GPT-4.1 and Gemini2.5 Pro. The approach attains these gains with only around 50 million trainable parameters and reasonable training time (~18 hours), proving both effective and efficient for real-world deployment.
Key findings
- VaaWIT surpasses cascaded OCR+translation pipelines by more than 50 BLEU points on ZH-EN, demonstrating the clear advantage of integrated end-to-end modeling.
- On the ZH-EN task, VaaWIT achieves 65.9 BLEU and 94.8 COMET, outperforming the previous best image translation model Translatotron-V by +13.3 BLEU and +11.7 COMET.
- VaaWIT trained on Qwen3 (8B LLM) outperforms fully fine-tuned Qwen3-VL by 7.6 BLEU and 2.9 COMET on IT-EN while using only 50M trainable parameters versus 8B for full fine-tuning.
- Removing the Dual-Stream Attention Module (DSAM) causes a 2.2 BLEU and 5.4 COMET drop on ZH-EN; removing Visual-Aware Adapter (VAA) causes 0.9 BLEU and 5.0 COMET drop, showing both components contribute significantly and synergistically.
- VaaWIT performs consistently well across eight language pairs including high-resource (EN, IT) and low-resource (ZH, JA, KO, HI, TH), showing good multilingual generalization.
- VaaWIT matches or exceeds in-house commercial LLM baselines GPT4.1 and Gemini2.5 Pro on many tasks, despite much smaller trainable parameter budgets.
- Different gating strategies for VAA reveal global gating (single gate vector) best balances performance and computational efficiency, achieving results within 0.4 BLEU of more expensive schemes but with ~50% less latency.
- Bidirectional cross-attention fusion in DSAM also outperforms simpler fusion methods (concatenation, element-wise sum, one-way attention) by 2-3 BLEU points, highlighting the importance of joint semantic-detail interaction.
Threat model
The implicit threat model considers noisy, stylistically diverse Web images containing embedded text where the main challenge is to accurately recognize and translate the text without error propagation. The adversary is effectively the visual complexity and semantic ambiguity of Web images, including varied fonts, backgrounds, and layouts. The model assumes access only to raw pixel inputs and does not consider active adversaries attempting to evade recognition or inject malicious content.
Methodology — deep read
Threat Model & Assumptions: The work focuses on the problem of multilingual Web image translation where the adversary is not explicitly defined, but the challenge is to robustly and accurately recognize and translate embedded text from noisy, stylistically diverse real-world web images across many languages. The model assumes input access to raw images containing text and aims to produce correct translations without error propagation common in cascaded OCR+NMT systems.
Data: Datasets include three public multilingual Web image datasets: MIT-10M (multiple language pairs including EN-IT, IT-EN, EN-JA, JA-EN), ECOIT (Chinese e-commerce ZH-EN), and OPUS-MIT-5M (synthetic social-media style HI-EN, KO-EN, TH-EN). Together these cover 8 translation tasks spanning high-resource and low-resource languages. Data consists of raw images with embedded text and paired target translation sequences. Standard splits and pre-processing are used.
Architecture / Algorithm: VaaWIT consists of three integrated components:
- Dual-Stream Visual Encoding: Parallel usage of a multilingual semantic encoder (e.g., SigLIP) capturing global semantic features F_sem, and a visual detail encoder (e.g., DINOv2) capturing fine-grained morphological details F_det, both producing patch-level feature sequences.
- Dual-Stream Attention Module (DSAM): Projects F_sem and F_det into a shared space, then performs bidirectional cross-attention (Semantic-Guided Detail Refinement and Detail-Informed Semantic Refinement) to produce refined features ^H_s and ^H_d, which are concatenated and fused by an MLP to yield a unified fused feature H_fused.
- Visual-Aware Adapter (VAA): A lightweight adapter module injected inside frozen LLM transformer layers that combines the fused global visual context vector h_g (via average pooling over H_fused) with the Feed-Forward Network output of each layer via a gating mechanism. The gating network predicts a soft gate vector that dynamically modulates adaptation impact per dimension. The adapter uses a bottleneck design with down-projection and up-projection weights, and is trained with residual connections to preserve underlying LLM knowledge.
- Training Regime: Training proceeds in two stages with frozen visual encoders and LLM backbone:
- Stage 1 (Visual-Language Alignment): The model is trained as an image captioning task to make H_fused aligned with the LLM semantic space, minimizing negative log-likelihood of ground-truth text tokens.
- Stage 2 (Multi-Task Joint Learning): The model jointly optimizes three tasks: Image-Text Matching (binary classification of matched pairs for semantic consistency), Text Translation Learning (text-to-text machine translation to retain general MT ability), and Image Translation Learning (multimodal translation conditioned on H_fused and source text). The final loss is a weighted sum prioritizing end-to-end translation. Optimization hyperparameters and hardware details (epochs, batch sizes, random seeds) are detailed in Appendix (unavailable here). Training of DSAM and VAA requires about one epoch per stage (~18 hours total).
Evaluation Protocol: Quantitative evaluation is conducted on 8 multilingual Web image translation tasks covering diverse language pairs. Metrics used include BLEU (SacreBLEU) for surface n-gram similarity and COMET, a neural metric measuring semantic similarity. Baselines include cascaded OCR+Google or Microsoft Translate APIs, zero-shot LVLMs (Qwen3-VL, LLaMA3.2, LLaVA-OV), and proprietary commercial LVLMs (GPT4.1, Gemini2.5 Pro), as well as SOTA end-to-end specialized image translation models (ItNet, PEIT, Translatotron-V, etc.). Ablations disable DSAM and/or VAA to measure individual component impact. Additional analysis studies gating strategies and feature fusion approaches. Experiments also compare adaptation/tuning methods such as Chain-of-Thought prompting, LoRA, and full fine-tuning.
Reproducibility: Code and trained weights are not explicitly stated as released. The datasets used are public benchmarks. Model relies on large pretrained components (SigLIP, DINOv2, Qwen3 or LLaMA3 LLMs) that may not be fully open-source. Thus full reproduction may require access to these resources. Training details and hyperparameters are provided in appendix.
Concrete Example: For input Web image X_v, DSAM extracts semantic and detail features F_sem and F_det using sigmoid and DINOv2 backbones. These are projected then cross-attended bidirectionally to produce H_fused. The global visual context h_g is pooled and fed into VAA modules inside each LLM transformer layer, gating adaptation onto the FFN output. The LLM generates translated tokens autoregressively conditioned on both visual and text features, optimized jointly on ITM, TTL, and ITL losses, enabling simultaneous OCR and translation with improved accuracy over cascaded or zero-shot LVLMs.
Technical innovations
- Dual-Stream Attention Module (DSAM) enabling bidirectional cross-attention fusion between multilingual semantic features and fine-grained visual details to produce a unified, noise-robust visual representation.
- Visual-Aware Adapter (VAA), a lightweight, parameter-efficient gating adapter dynamically modulating frozen LLM transformer layers with fused visual cues for visual-context grounded linguistic generation.
- Two-stage training combining visual-language alignment (image captioning) with multi-task joint learning (image-text matching, text translation, and image translation) to effectively align visual and linguistic modalities in a frozen LVLM setup.
- Demonstration that parameter-efficient tuning of only DSAM and VAA (~50M parameters) on large frozen LLM backbones can surpass full fine-tuning and zero-shot baselines, achieving competitive or superior performance compared to proprietary commercial systems.
Datasets
- MIT-10M — Large-scale multilingual web images with text — Public dataset
- ECOIT — Chinese e-commerce product images (ZH-EN) — Public dataset
- OPUS-MIT-5M — Synthetic social media style multilingual images (HI-EN, KO-EN, TH-EN) — Public dataset
Baselines vs proposed
- EasyOCR + Google Translate API: ZH-EN BLEU = 9.9 vs VaaWIT 65.9
- PP-OCR + Microsoft Translator API: ZH-EN BLEU = 9.6 vs VaaWIT 65.9
- Qwen3-VL (8B) zero-shot: ZH-EN BLEU = 32.3 vs VaaWIT 65.9
- Qwen3-VL (32B) zero-shot: ZH-EN BLEU = 39.2 vs VaaWIT 65.9
- LLaMA3.2 (90B) zero-shot: ZH-EN BLEU = 7.9 vs VaaWIT 65.9
- GPT4.1 commercial model: ZH-EN BLEU = 46.1 vs VaaWIT 65.9
- Gemini2.5 Pro commercial model: ZH-EN BLEU = 40.1 vs VaaWIT 65.9
- Full Fine-Tuning Qwen3-VL: IT-EN BLEU = 58.4 vs VaaWIT 66.0
- ItNet: ZH-EN BLEU = 39.3 vs VaaWIT 65.9
- Translatotron-V: ZH-EN BLEU = 52.6 vs VaaWIT 65.9
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.24675.

Fig 1: Overview of VaaWIT. It addresses the complexity of Web image translation by decomposing the visual-linguistic

Fig 2 (page 2).

Fig 3: Case Study of VaaWIT Framework.

Fig 4 (page 9).

Fig 5 (page 9).

Fig 6 (page 9).

Fig 7 (page 9).

Fig 8 (page 9).
Limitations
- The approach relies on large frozen pretrained visual backbones (SigLIP, DINOv2) and LLMs (Qwen3, LLaMA3), which may limit accessibility or generalization to other architectures.
- Training details for hyperparameters and hardware are provided but no public code or checkpoints were explicitly released, limiting full reproducibility.
- No explicit adversarial robustness or out-of-domain distribution shift analysis is reported.
- Although experiments cover multiple languages, extremely low-resource languages or complex scripts requiring specific OCR models remain untested.
- Evaluation focuses on automated metrics (BLEU, COMET); no human evaluation or qualitative error analyses are reported in detail.
- Dependence on average pooling of fused features into a global context vector may lose some spatial detail important for very complex layouts.
Open questions / follow-ons
- How well does VaaWIT generalize to unseen languages, scripts, or highly artistic fonts not present in training data?
- What is the robustness of VaaWIT to adversarially perturbed images or occlusions designed to fool OCR and translation?
- Can the DSAM and VAA mechanisms be extended or improved to incorporate spatial layout or multi-line text structure explicitly?
- How does human evaluation compare to automatic metrics in assessing real-world translation quality, error types, and usability?
Why it matters for bot defense
This work is highly relevant for CAPTCHA and bot-defense practitioners involved in multimodal challenges, particularly those designing systems to distinguish human-generated multilingual web text in images and translating it accurately. VaaWIT demonstrates that combining fine-grained visual detail extraction with large language models via efficient adapters can significantly improve recognition and translation accuracy compared to cascaded systems or zero-shot LVLMs. For CAPTCHA systems relying on text-in-image challenges, adapting LLMs with dual-stream attention and gated visual adapters could improve robustness against automated solvers by better understanding complex visual-text interactions and reducing hallucinations. Furthermore, the parameter-efficient tuning approach enables deployment on large backbones without prohibitive computational cost, facilitating practical integration into live bot-defense pipelines that require quick inference and adaptation across languages. Finally, the insight that naive concatenation of visual and semantic features leads to hallucinations underscores the importance of deep multimodal fusion designs to prevent solver errors arising from ambiguous visual cues.
Cite
@article{arxiv2605_24675,
title={ VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation },
author={ Bo Li and Ronghao Chen and Ningyuan Deng and Huacan Wang and Shaolin Zhu and Lijie Wen },
journal={arXiv preprint arXiv:2605.24675},
year={ 2026 },
url={https://arxiv.org/abs/2605.24675}
}