ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Source: arXiv:2605.15198 · Published 2026-05-14 · By Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

TL;DR

This paper addresses the challenge of visual reasoning tasks that require interleaved generation and use of intermediate visual states. Existing approaches fall into three paradigms: unified models that explicitly generate intermediate images but with high computational and architectural cost; agentic models that generate code or tool calls to external modules, incurring context-switching latency; and latent visual reasoning that operates on continuous latent embeddings but suffers from limited generalization and incompatible autoregressive training. The authors propose ATLAS, a novel framework that represents discrete visual operations as special "functional tokens" embedded in the standard tokenizer vocabulary and integrated in the autoregressive sequence. Each functional token corresponds internally to a visual reasoning step (e.g., drawing an auxiliary line or adding text annotation), eliminating verbose visual intermediate generation and external execution. This design preserves compatibility with existing scalable SFT and RL training methods and maintains interpretability.

To address the RL training difficulty caused by the sparsity of functional tokens amidst many text tokens (called gradient dilution), the authors introduce Latent-Anchored GRPO (LA-GRPO), which adds a statically weighted auxiliary loss focusing updates on functional tokens. The system is trained in two stages: supervised fine-tuning on a newly curated ATLAS-178K dataset with functional-token annotated reasoning trajectories spanning 40+ tasks, followed by reinforcement learning optimizing a composite reward that balances answer correctness, functional-token usage, format adherence, and penalties for verbosity or spamming. Experiments across multiple challenging benchmarks (V*, BLINK, WeMath) demonstrate that ATLAS significantly improves performance over baselines and prior VLMs. LA-GRPO enhances stability and reasoning effectiveness compared to standard GRPO. Qualitative examples show interpretable, stepwise visual reasoning without generating explicit intermediate images. The results show ATLAS effectively bridges agentic and latent visual reasoning, providing a compact, generalizable, and training-efficient paradigm for complex multimodal reasoning.

Key findings

ATLAS functional tokens occupy only 2.3% of generated tokens but yield substantial visual reasoning capacity.
On the BLINK benchmark, Qwen2.5-VL achieves 22.8% accuracy, while ATLAS with LA-GRPO reaches 51.3%.
ATLAS-SFT alone improves BLINK from 22.8% to 46.0%, showing the supervised dataset and token design yield strong gains.
LA-GRPO improves over standard GRPO, increasing multi-view reasoning from 43.6% to 53.4%.
Training uses a curated ATLAS-178K dataset covering 40+ tasks extracted from V-Interaction-400K and supplemented with V-Perception-40K for perceptual preservation.
Five discrete functional tokens (<|Manip|>, <|Shape|>, <|Line|>, <|Arrow|>, <|Text|>) map to common visual operations, simplifying visual reasoning tokenization.
LA-GRPO's auxiliary token-level objective stabilizes training by preventing gradient dilution on sparse functional tokens during RL.
ATLAS maintains compatibility with standard autoregressive transformer training without architectural changes, enabling efficient scaling.

Threat model

n/a - This work is primarily a model architectural and training methodology contribution for visual reasoning. It does not formally define an adversary or security threat model.

Methodology — deep read

Threat Model & Assumptions: The paper focuses on visual reasoning model design rather than direct adversary threat models; the "adversary" is implicitly the challenges of efficiently modeling intermediate visual operations in multimodal reasoning with minimal overhead and maximal compatibility. There is no hostile adversary assumed.
Data: The primary training dataset is ATLAS-178K, derived by parsing and extracting functional visual operations from the publicly available V-Interaction-400K preview subset. The authors extract visual operations (e.g., line drawing, text annotation) that can be mapped to the functional token set, filtering 138K high-quality samples across 40+ tasks. This is augmented with V-Perception-40K for preserving low-level visual ability. For RL fine-tuning, datasets We-Math 2.0, MMK12, and ThinkLite are used for diverse visual reasoning supervision. Preprocessing involves converting code actions into token trajectories with inserted functional tokens, polished with Gemini-2.5-Pro for naturalness.
Architecture / Algorithm: ATLAS extends a standard autoregressive vision-language model (Qwen2.5-VL-7B) by augmenting its vocabulary with 5 discrete functional tokens representing internalized visual operations (<|Manip|>, <|Shape|>, <|Line|>, <|Arrow|>, <|Text|>). These tokens are generated as part of the normal next-token prediction sequence interleaved with textual tokens. No changes are made to the transformer architecture, vision encoder or decoder pipelines. The model learns to invoke these tokens at reasoning steps needing specific visual manipulation or annotation. This contrasts with agentic methods that call external code/tool executors or latent methods that use continuous embeddings. ATLAS thus bridges both in a compact vocabulary-level representation.
Training Regime: Two-stage training is used. Stage 1 is supervised fine-tuning (SFT) on ATLAS-178K using standard cross-entropy loss over sequences containing functional tokens. The vision encoder is frozen; visual projector and language model are trained. Stage 2 is reinforcement learning with a reward-weighted policy optimization (GRPO), optimizing a composite reward combining answer accuracy, functional token usage, formatting compliance, and penalties for verbosity and token spamming. To overcome sparse functional token gradient dilution during RL, Latent-Anchored GRPO (LA-GRPO) adds an auxiliary token-level loss focused only on functional token positions with static weighting. This stabilizes functional token training without altering the sequence-level GRPO objective. RL updates the aligner and language model for 1 epoch.
Evaluation Protocol: ATLAS is evaluated on multiple challenging visual reasoning benchmarks including V*, BLINK, and WeMath, covering a range of question types and complexity. Automated evaluation uses rule-based answer parsing followed by an LLM judge (Qwen3-VL-235B-A22B-Instruct) for correctness and format validation. Performance is reported as averaged accuracy. Baselines include closed-source models (GPT-4o, Claude-4-Sonnet), open VLMs (Qwen2.5-VL, Gemini2.5-Pro), and agentic and latent reasoning approaches. Ablations compare SFT-only, GRPO, and LA-GRPO training. Statistical significance or cross-validation details are not explicitly stated.
Reproducibility: The paper references code and dataset repositories publicly (project page provided), though the curated ATLAS-178K dataset is newly created and sourced from public higher-level datasets. Models are based on publicly known Qwen2.5-VL architecture. Training details including freezing strategy, epochs, and hyperparameters are described at a high level but no seed or hardware specifics are given. The functional token vocabulary and mapping tables are explicitly provided, enabling reproducibility of main ideas, though complete training pipeline code is not explicitly said to be released.

Concrete example: For a visual reasoning query "How many plates on the counter?", the model predicts a normal text sequence interleaved with functional tokens like <|Shape|> (to highlight spatial regions), <|Arrow|> (to point out relevant items), and <|Text|> (to annotate counts), internally representing the reasoning states without generating intermediate images. This leads to improved answer correctness by guiding multi-step reasoning with sparse discrete visual operations learned from training trajectories.

Technical innovations

Introduction of discrete functional tokens embedded within the standard tokenizer vocabulary to represent internalized visual reasoning operations, combining properties of agentic and latent methods.
Design of a small, generalizable token taxonomy (<|Manip|>, <|Shape|>, <|Line|>, <|Arrow|>, <|Text|>) for diverse visual operations enabling compact, interpretable visual reasoning.
Training methodology combining supervised fine-tuning on functional-token annotated trajectories with reinforcement learning using a composite reward balancing answer correctness, token usage, and output formatting.
Latent-Anchored GRPO (LA-GRPO), a novel RL objective augmenting standard policy optimization with a token-level auxiliary loss on sparse functional tokens to mitigate gradient dilution and stabilize training.

Datasets

ATLAS-178K — 138,000 examples — curated from parsed code in V-Interaction-400K public subset
V-Perception-40K — 40,000 examples — public dataset for low-level visual supervision
We-Math 2.0 — unspecified size — visual reasoning benchmark
MMK12 — unspecified size — visual reasoning benchmark
ThinkLite — unspecified size — visual reasoning benchmark
V-Interaction-400K (preview subset) — publicly released data source for ATLAS-178K parsing

Baselines vs proposed

Qwen2.5-VL on BLINK: accuracy = 22.8% vs ATLAS SFT: 46.0% vs ATLAS GRPO: 50.5% vs ATLAS LA-GRPO: 51.3%
Qwen2.5-VL on WeMath: 36.2% vs ATLAS SFT: 28.9% vs ATLAS GRPO: 40.3% vs ATLAS LA-GRPO: 45.0%
ATLAS LA-GRPO multi-view reasoning accuracy improves from 43.6% (GRPO) to 53.4%
On average across V* and BLINK benchmarks, ATLAS LA-GRPO outperforms models like V-Thinker, MCOT, Latent Visual Reasoning (LVR), and unified or agentic visual reasoning baselines.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.15198.

Fig 1

Fig 1: Comparison of Visual Reasoning Paradigms. I: Unified models generate intermediate pixel-level images. II: Agentic

Fig 2

Fig 2: Overall Pipeline of ATLAS. ATLAS represents visual operations as functional tokens within the standard

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 4).

Fig 3

Fig 3: Latent-Anchored GRPO. Standard GRPO provides sequence-level advantages to all generated tokens, which can

Fig 8

Fig 8 (page 5).

Limitations

Functional token vocabulary is limited to five general tokens, which may restrict expressiveness for more complex or specialized visual operations.
The ATLAS-178K dataset and mapping rely on well-structured code extraction and may not cover all visual reasoning scenarios or noisy real-world contexts.
Authors note some unstable functional-token invocation behaviors after RL, such as token spamming, despite mitigation with LA-GRPO.
Evaluation lacks explicit mention of statistical significance tests or robustness to distribution shifts or adversarial examples.
Hardware, training seeds, and full code details are not fully described, hindering exact reproducibility.
The solution focuses on efficiency and training compatibility but does not analyze inference latency or resource consumption quantitatively compared to unified visual generation methods.

Open questions / follow-ons

How does expanding the functional-token vocabulary to more diverse or fine-grained visual operations impact scalability and reasoning performance?
Can ATLAS functional tokens be integrated with or learned from visual grounding or pixel-level supervision to improve precision?
How robust is ATLAS to noisy or ambiguous inputs, and does it generalize well to out-of-domain visual reasoning tasks?
What are the trade-offs in inference efficiency and latency compared to existing agentic or unified visual reasoning frameworks in large-scale deployments?

Why it matters for bot defense

ATLAS presents a compact and scalable approach to interleaved visual reasoning by representing discrete visual operations as functional tokens within an autoregressive sequence. For bot-defense and CAPTCHA engineers, this method provides a promising way to integrate multi-step visual reasoning directly into language-based models with minimal overhead, avoiding costly intermediate image generation or external tool dependencies. The functional-token design offers interpretable signaling of visual operations that could enhance CAPTCHA challenges requiring stepwise reasoning or contextual image understanding.

Additionally, the LA-GRPO training technique addresses challenges in optimizing sparse yet critical reasoning signals in token sequences, a useful insight for improving reinforcement learning policies on multimodal tasks with sparse supervision. Overall, ATLAS's architectural and training innovations could inspire new CAPTCHA generation or verification mechanisms that leverage lightweight but powerful visual reasoning capabilities embedded directly within language models, potentially increasing robustness against automated solvers while maintaining efficient training and inference.

Cite

bibtex

@article{arxiv2605_15198,
  title={ ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both },
  author={ Ziyu Guo and Rain Liu and Xinyan Chen and Pheng-Ann Heng },
  journal={arXiv preprint arXiv:2605.15198},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15198}
}

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​