DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

Source: arXiv:2605.29615 · Published 2026-05-28 · By Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

TL;DR

This paper introduces DiffSpot, a benchmark designed to evaluate vision-language models (VLMs) on the fine-grained visual perception task of spotting subtle differences in rendered web interfaces. Existing VLM benchmarks focus on high-level image-text alignment but often neglect subtle, localized visual changes within complex GUI contexts. DiffSpot addresses this gap by programmatically generating before/after screenshot pairs of web pages where exactly one CSS property of a single target element is mutated. The benchmark contains 4,400 pairs balanced over 13 CSS operators and three difficulty tiers, plus a no-difference control set to measure hallucination. Evaluations of 13 recent VLMs zero-shot reveal very limited ability to identify these fine-grained changes, with top model Recall only 40.7%, and Hard-tier Recall below 23% across the board. Notably, difficulty is dominated by the type of CSS property changed—not by magnitude of pixel difference or CLIP embeddings—indicating that current VLMs struggle to perceive and name subtle CSS-level visual attributes. This demonstrates that open-ended visual difference detection on web UI remains a significant challenge.

Key findings

DiffSpot consists of 4,400 image pairs: 3,900 has-diff pairs balanced across 13 CSS properties × 3 difficulty tiers (100 per cell), plus 500 no-diff pairs for hallucination control.
The best model, Gemini 3.1 Pro, achieves only 40.7% recall on true visual changes, missing roughly 3 out of 5 changes (Table 1).
Recall drops sharply with difficulty: Gemini 3.1 Pro falls from 60.5% on Easy to 22.7% on Hard-tier changes (−37.8 pp), with all models below 23% on Hard.
Per-operator recall varies greatly, with justify changes detected at up to 87.0% recall, while gradient, line_height, and rounded operators remain under 27% (Fig 3a).
Recall correlates poorly with pixel magnitude of difference (r = −0.08) and CLIP image distance (r = +0.06), showing that visual signal strength alone does not predict VLM detection ability (Fig 4).
No-diff pairs reveal a trade-off: models with higher recall tend to hallucinate more differences on unchanged pairs, while some abstain completely, achieving nearly 100% specificity but near-zero recall.
Scaling model size alone does not guarantee improvement; reasoning mode helps only at larger scale in the Qwen3-VL family (e.g., 30B to 235B improves Recall by 12.4 pp when using "Thinking" vs "Instruct" mode).
Judging model responses using three different LLM judges shows high consistency (Cohen’s κ > 0.9) and stable model rankings.

Threat model

The task models an adversary as a black-box vision-language model attempting zero-shot open-ended detection of subtle style differences localized to a single web interface element, with no ability to access or modify the underlying HTML/CSS or examine other elements. The adversary cannot rely on semantic or large-scale visual cues but must perceive fine-grained pixel-level differences localized precisely within the mutated element’s bounding box.

Methodology — deep read

The paper's core methodology revolves around constructing a high-quality benchmark for fine-grained visual difference detection on rendered web interfaces using a 5-stage pipeline:

Threat model & assumptions: The task assumes a VLM must identify exactly what changed in a pair of near-identical screenshots—specifically one CSS property’s mutation on a single target element—with no prior fine-tuning. Adversaries are not the focus; rather, it tests model perception limits.
Data collection & processing: Starting from 2 million seed domains (Chrome User Experience Report and Majestic top 1M), the authors crawl and render 17.75 million page URLs using a headless Chromium + Playwright setup at fixed viewport 1280×800. They deduplicate by HTML structure fingerprint, yielding 9.04M unique pages. Each page is paired with a self-contained HTML regenerated by a large language model to strip licensed content. Candidate pairs are filtered for realness using three VLM judges, content rules (to exclude PII etc.), and an LLM domain/style labeler.
Programmatic mutation: They define 13 CSS-property-level mutation operators grouped into typography, color, layout, and shape. Each operator has two mutation mechanisms (Tailwind CSS class swap or inline-style override). For each operator, mutations are stratified into Easy/Medium/Hard tiers by magnitude (e.g., size of color change, rounding radius offsets). Mutations act only on one element and property to isolate differences.
Grounding gate: To ensure mutations produce exactly one local pixel difference, they generate the mutated screenshot and compute pixel difference constrained to the bounding box of the target element. Pairs with zero inside or non-zero outside bbox difference are rejected to ensure pixel change is localized only inside the element.
Polish and filtering: The accepted grounded mutations are converted into polished natural language difference descriptions by an LLM. Additional filters remove low-quality cases.
Stratified sampling generates the final balanced dataset with 100 pairs per operator×tier cell (3,900 in total) plus 500 no-diff controls (identical pairs rendered twice).

Evaluation is zero-shot. Thirteen recent VLMs including proprietary APIs (Gemini, GPT-5.4, Claude Opus) and open-weight models (Kimi K2.5, Qwen3.5, GLM-4.6V, InternVL3.5) are evaluated using a uniform prompt instructing the model to list differences between before/after screenshots. Models output open-ended text, which is auto-scored by an LLM judge under a visual-effect equivalence rubric tolerant to paraphrase. Metrics reported include per-tier Recall on has-diff pairs and specificity on no-diff pairs.

The authors select a 16,384-token output budget with greedy decoding. Statistical validation shows the three judge LLMs have strong inter-rater reliability (Cohen’s kappa 0.92–0.94). A realness audit confirms rendered pages are visually close to originals.

In sum, DiffSpot makes difference generation programmatic, controlled, and reproducible, allowing fine-grained evaluation of zero-shot VLM perception under controlled variation of mutation type and magnitude, with hallucination control.

Technical innovations

Code-driven difference generation by programmatically mutating single CSS properties in self-contained HTML pages to create near-identical before/after web UI screenshots.
Grounding gate mechanism validating that pixel difference is strictly localized within the target element’s bounding box to enforce precise code-to-pixel correspondence of changes.
A stratified dataset design balancing 13 CSS operators and 3 difficulty tiers with 100 pairs per cell plus no-diff controls to comprehensively probe property-level VLM capabilities.
Open-ended evaluation using LLM judges that match model free-text lists of visual differences against programmatic ground truth under a visual-effect-equivalence rubric.
Analysis showing pixel difference magnitude and CLIP embedding distance do not predict difficulty, highlighting the limitations of current visual features for subtle CSS-level changes.

Datasets

DiffSpot — 4,400 pairs (3,900 has-diff + 500 no-diff) — programmatically generated via mutation on 9.04M unique web pages, public at https://huggingface.co/datasets/tencent/DiffSpot

Baselines vs proposed

Trivial always-no-diff baseline: Accuracy = 11.4%
Gemini 3.1 Pro (proprietary): Recall has-diff = 40.7%, Overall Accuracy = 47.2%
Kimi K2.5 (open-weight, 1T params): Recall has-diff = 36.4%, Overall Accuracy = 42.2%
InternVL3.5-30B-A3B (open-weight): Recall has-diff = 4.2%, Overall Accuracy = 15.0%
Qwen3-VL-235B-Thinking vs Qwen3-VL-235B-Instruct: Accuracy 28.3% vs 15.9%, Easy Recall 30.1% vs 9.6%, showing +12.4 pp benefit from reasoning mode at scale
Models with higher recall tend to hallucinate more on no-diff pairs; GLM-4.6V-Flash hallucinates 24.2%, Claude Opus 4.7 hallucinates 0.4%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.29615.

Fig 3

Fig 3: Recall heatmaps across 13 models. Cells: has-diff Recall (%). Columns: models, sorted

Fig 4

Fig 4: Per-operator visual-signal magnitude vs. recall. Each dot is one CSS operator; both axes

Limitations

Benchmark focuses exclusively on single atomic CSS-property mutations, ignoring compound or reflow-heavy changes common in practice.
Zero-shot evaluation does not consider fine-tuning or adaptation which might improve recognition of subtle style changes.
The no-diff control only measures hallucination under identical renderings, not robustness to adversarial or noisy inputs.
Using LLM judges introduces potential bias despite high inter-rater agreement; human evaluation might reveal nuance missed.
Dataset limited to desktop viewport 1280×800 and headless Chromium rendering; real-world platforms and devices vary widely.
CSS mutation parameters and operator selections are fixed; other visual properties or frameworks might expose additional failure modes.

Open questions / follow-ons

Can fine-tuning or auxiliary training on CSS property changes improve VLM fine-grained visual diff detection substantially?
What role do architectural elements, multimodal fusion strategies, or attention mechanisms play in better parsing subtle style changes?
How do compound or multi-attribute visual changes impact model detection compared to single-property mutations?
Can incorporating layout or DOM structure information alongside pixel data boost performance on web UI difference tasks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners focusing on web interfaces, DiffSpot provides a controlled benchmark exposing the current limitations of state-of-the-art VLMs on subtle GUI-level visual change detection. This highlights the difficulty of training or leveraging VLMs to robustly perceive localized style or layout differences that might be exploited or monitored in bot detection or design validation. The property-level stratification gives actionable insights into which types of visual changes are most challenging for models, informing where defenses relying on fine UI manipulation visibility might fail or succeed. The grounding gate approach also suggests a methodological framework for validating visual changes precisely localized to intended UI components, which could improve automated toolchains for interface change verification or bot interaction auditing. However, the benchmark’s constraints (single-property mutations, zero-shot settings) mean that applying these findings to dynamic, real-world UX changes or adversarial bot behaviors requires further adaptation and robustness research.

Cite

bibtex

@article{arxiv2605_29615,
  title={ DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? },
  author={ Linhao Zhang and Aiwei Liu and Yuan Liu and Xiao Zhou },
  journal={arXiv preprint arXiv:2605.29615},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.29615}
}

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces? ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​