Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Source: arXiv:2606.05702 · Published 2026-06-04 · By Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu et al.

TL;DR

This paper addresses the underexplored ability of Vision-Language Models (VLMs) to perform chronological reasoning—understanding and ordering events or objects across time—beyond simple frame sequencing common in video benchmarks. The authors propose a novel benchmark suite designed specifically to probe how VLMs perceive and reason about temporal information both within images and across modalities (image and text). To this end, they construct three specialized datasets: (1) CHA, featuring visually similar Chinese artifacts spanning five dynasties with subtle stylistic evolution; (2) SPEED, a diverse collection of timestamped images spanning modern history in five thematic subdomains; and (3) HistNews, a curated set of historic news events paired with images for cross-modal chronological alignment. Through extensive zero-shot evaluation of six state-of-the-art models (both open and closed source), they find that while VLMs show some promise it is often superficial—models heavily rely on spurious shortcuts like color cues (e.g., associating grayscale with older time periods) rather than authentic chronological reasoning. This work offers a rigorous multipronged evaluation framework and diagnostic insights to guide future design of multimodal models with robust temporal understanding.

Key findings

Qwen2.5-VL-7B fine-tuned on CHA improves average dynasty classification accuracy from 43.43% (zero-shot) to 56.57% (Table II), confirming learnable chronological signals.
Six evaluated VLMs (Gemini, GPT, Qwen, InternVL, MiniCPM, GLM) attain overall chronological reasoning accuracy below 60% on combined benchmarks (Fig. 1), indicating substantial room for improvement.
Shortcut Task experiments reveal that removing color cues (converting images to grayscale) drops model accuracy by up to 20-30% in recognizing temporal order, evidencing reliance on spurious 'grayscale is old' heuristics.
Artifact-Sort task Kendall's Tau scores for fine-grained intra-category sorting are low (<0.4), showing limited ability to correctly order visually similar items chronologically.
In News-Multimodal tasks requiring alignment of textual event time with image year, models achieve below 50% accuracy under zero-shot, highlighting weaknesses in cross-modal temporal grounding.
Manual curation removed explicit OCR chronological cues (e.g., year digits in images) to ensure evaluations reflect genuine chronological reasoning rather than direct text reads.
Datasets cover wide temporal ranges: CHA spans five Chinese dynasties (~600 years), SPEED captures 1952–2025 across five domains, and HistNews covers 1946–2025 events with cross verification by multiple annotators.
Randomization of prompts and answer choices ensured no positional bias in multiple-choice evaluations.

Threat model

The evaluation treats the Vision-Language Models as adversaries potentially exploiting spurious cues like image color rather than exhibiting genuine chronological reasoning. The models have no direct access to metadata or explicit year labels (these are removed or randomized). Their knowledge is limited to learned patterns from training data. The threat is in models producing superficially correlated but logically incorrect chronological predictions by leveraging shortcuts such as grayscale implying old images. The evaluation aims to detect and quantify this behavior.

Methodology — deep read

The paper constructs a rigorous evaluation framework targeting chronological reasoning in Vision-Language Models, comprising three main subtasks: Artifacts, Shortcut, and News tasks.

Threat model & assumptions: The adversary here is conceptually the VLM itself, which may rely on shortcuts or superficial cues rather than authentic chronological signals. The evaluation seeks to isolate whether models genuinely understand time-based logic, assuming no explicit chronological metadata or direct textual leakage is provided.
Data provenance and preprocessing:

CHA dataset: 887 expert-verified high-res images of Chinese artifacts from five dynasties. Data sourced from official governmental open data platforms (e.g., https://data.gov.tw), with expert annotation to ensure chronological fidelity.
SPEED dataset: Initially 2,077 images crawled from Wikimedia Commons spanning 1952 to 2025 across Sports, Politics, Electronics, Emergency, and Diversity categories. After manual filtering for quality and removing explicit OCR cues and ambiguous cityscapes, 1,028 images remain.
HistNews: 400 curated news events (1946-2025) from Wikipedia, manually verified and structured in JSON with time, description, and source URL.

Architecture/algorithms:

The benchmark is model-agnostic; six state-of-the-art VLMs are evaluated in zero-shot mode using unified prompt templates.
The Shortcut Task introduces controlled grayscale versions of images using ITU-R 601 luma conversion and three-channel reconstruction to test reliance on color as a shortcut.

Training regime:

One supervised fine-tuning baseline was produced using Qwen2.5-VL-7B fine-tuned on CHA, improving accuracy from 43.43% to 56.57%.
All other evaluations on zero-shot models with deterministic decoding (temperature=0) and randomized input ordering to prevent positional bias.

Evaluation protocol:

Artifacts Task:
- Chronological Localization: multi-choice dynasty classification accuracy.
- Artifacts-Sort: Order ranking using Kendall’s Tau score across intra-category (n=4) and cross-category (n=5) sets.
Shortcut Task: Three triplets per test pair (color-color, grayscale-color, color-grayscale) to isolate color effect on binary chronological order classification; 1,200 controlled tests.
News Task:
- News-Year: mean absolute error on predicting the exact year from a single image.
- News-Multimodal: multi-choice accuracy on aligning images with event textual descriptions.
Statistical controls include prompt standardization, shuffling images and options, and human annotator consensus to exclude direct clue leakage.

Reproducibility:

Source code and datasets released at https://github.com/LuoRenqiang/ChronoVision.
Datasets partially from public sources (government websites, Wikimedia Commons, Wikipedia) with expert/manual verification.

Concrete example: In the Artifacts-Chronological Localization task, the model receives an image of a jade artifact with a prompt: “In which dynasty did this artifact first appear? Options: Tang, Song, Yuan, Ming, Qing.” The zero-shot Qwen2.5-VL-7B model predicts “Song” while ground truth is “Tang.” After fine-tuning on CHA, the same model correctly predicts “Tang,” increasing accuracy and confirming the learning signal for fine-grained stylistic chronological discrimination. Meanwhile, in the Shortcut Task, replacing color images with grayscale versions causes model confidence in ordering temporal pairs to diminish, confirming spurious color shortcut usage.

Technical innovations

A novel benchmark framework evaluating chronological reasoning in Vision-Language Models beyond video-frame ordering, emphasizing long-span historical evolution and multimodal integration.
Construction of three specialized datasets (CHA, SPEED, HistNews) combining expert-verified historical artifacts, timestamped domain-diverse photos, and curated event-based text-image pairs to holistically probe chronological reasoning and alignment.
Development of a Shortcut Task isolating color bias via controlled grayscale transformations to diagnostically reveal VLMs’ exploitation of superficial visual shortcuts instead of authentic temporal logic.
Rigorous human-in-the-loop verification protocols ensuring removal of explicit OCR or direct chronological metadata, enforcing assessment of genuine reasoning rather than trivial shortcut exploitation.

Datasets

CHA (Chinese Historical Artifacts) — 887 images spanning Tang to Qing dynasties (600+ years) — sourced from official government open platforms, annotated by museum experts.
SPEED (Sports, Politics, Electronics, Emergency, Diversity) — 1,028 curated, timestamped images from 1952–2025 — from Wikimedia Commons, filtered for ambiguity and OCR leakage.
HistNews — 400 historic events from 1946 to 2025 paired with images — curated from Wikipedia and manually verified.

Baselines vs proposed

Qwen2.5-VL-7B zero-shot on CHA: average dynasty classification accuracy = 43.43% vs fine-tuned: 56.57% (Table II).
Overall performance (Fig. 1): Gemini ~55%, GPT ~58%, Qwen ~52%, InternVL ~48%, MiniCPM ~45%, GLM ~40% integrated accuracy across all benchmark tasks.
Shortcut Task color vs grayscale performance drop ~20-30%, illustrating color shortcut reliance.
Artifacts-Sort Kendall’s Tau scores <0.4 intra-category, indicating poor chronological ordering.
News-Multimodal task zero-shot accuracy <50%, suggesting weak cross-modal temporal alignment.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.05702.

Fig 2

Fig 2: Sample images and data structures from the proposed benchmark, including the Artifacts (CHA), Shortcut (SHEEP), and News tasks (HistNews).

Fig 3

Fig 3: A photo of cityscapes excluded from the Politics.

Fig 4

Fig 4: A photo with OCR excluded from the Sports.

Fig 1

Fig 1: Overall performance of six VLMs across the proposed benchmark.

Fig 5

Fig 5: An Example of the Artifacts-Chronological Localization Task.

Fig 6

Fig 6: An Example of the Artifacts-Sort Task.

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

Evaluations primarily zero-shot with only one fine-tuning example on CHA; generalization of fine-tuning effects on other datasets remains untested.
Models evaluated are predominantly large-scale transformers; applicability to other architectures or lightweight models unexplored.
Shortcut Task controls for color but other shortcuts (e.g., texture or style correlations) may remain unexamined.
Historical artifact categories focus on Chinese dynasties; geographical and cultural generalization to other histories is uncertain.
No adversarial or robustness testing against deliberate temporal perturbations is reported.
Distribution shift in real-world deployment scenarios and temporal domain gaps not examined.

Open questions / follow-ons

Can fine-tuning or specialized temporal reasoning architectures significantly reduce shortcut reliance and improve chronological understanding across diverse domains?
How do different VLM architectures (e.g., contrastive vs generative) compare in their ability to encode fine-grained temporal features?
What additional modalities or auxiliary signals (e.g., metadata, temporal graph embeddings) can enhance multimodal chronological alignment?
How robust are these models’ temporal reasoning capabilities under adversarially perturbed or out-of-distribution chronological contexts?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights a subtle yet critical dimension of multimodal model evaluation: temporal reasoning and shortcut susceptibility. Many security systems leverage vision-language trustworthiness to detect spoofing or contextual anomalies. Understanding that current VLMs often rely on superficial visual features like color rather than authentic temporal reasoning informs practitioners that models could be fooled by simple image manipulations (e.g., grayscale conversion) or exploiting outdated visual priors. This benchmark and datasets provide a diagnostic tool for analyzing bot or adversarial inputs that manipulate visual time cues to evade detection. Moreover, the demonstrated difficulty in aligning textual and visual timestamps underscores the challenge in verifying multimodal temporal authenticity in suspicious user-generated content, a potential vector attackers could exploit. Integrating these temporal reasoning tests into CAPTCHA or bot-defense pipelines could enhance detection of temporal inconsistencies characteristic of automated or manipulated media.

Cite

bibtex

@article{arxiv2606_05702,
  title={ Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models },
  author={ Haoyu Zhou and Qing Qing and Caichong Li and Qixin Zhang and Yongcheng Jing and Ziqi Xu and Juncheng Hu and Xikun Zhang and Renqiang Luo },
  journal={arXiv preprint arXiv:2606.05702},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.05702}
}

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​