From Web to Pixels: Bringing Agentic Search into Visual Perception

Source: arXiv:2605.12497 · Published 2026-05-12 · By Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

TL;DR

This paper studies a harder open-world version of visual perception: instead of grounding or segmenting an object based only on what is visible or what a model already knows, the system must first recover the target identity from external evidence on the web and then map that resolved entity back to a specific object in the image. The authors call this setting Perception Deep Research. The core motivation is that many real queries involve recent events, long-tail entities, brand/creator relations, or multi-hop clues that are not directly inferable from pixels alone.

To make the problem measurable, they introduce WebEyes, an object-anchored benchmark that links each annotated instance to verifiable web evidence and knowledge-intensive questions, with three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. They also propose Pixel-Searcher, an agentic workflow that searches for hidden entity evidence, resolves the entity, and then binds it to a box, mask, or answer choice. On WebEyes, Pixel-Searcher is the strongest open-source method across all three task views, but the paper finds that most failures happen before final mask refinement: the bottlenecks are evidence acquisition, identity resolution, and instance binding.

Key findings

WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples; 637 of the QA pairs support Search-based VQA options.
On Search-based Grounding, Pixel-Searcher improves Qwen3-VL-8B from 26.81 to 34.17 IoU and from 32.61 to 41.30 [email protected] (Table 2).
On Search-based Segmentation, Pixel-Searcher improves Qwen3-VL-8B from 35.78 to 39.17 gIoU and from 25.94 to 32.41 cIoU (Table 3).
On Search-based VQA, Pixel-Searcher improves Qwen3-VL-8B from 36.34 to 42.24 overall accuracy (Table 4).
The benchmark construction pipeline rejected 38.2% of automatically generated candidates with shortcut filters, then rejected another 49.2% of the remaining samples in manual review.
Among 389 segmentation failures analyzed, 304 were search/entity errors, 75 were entity-correct region errors, and only 10 were box-to-mask transfer errors; the dominant failure mode is upstream of masking.
In Search-based Grounding, the best closed-source model in Table 2 is Gemini-3.1-Pro at 30.52 IoU / 35.09 [email protected] overall, while Pixel-Searcher reaches 34.17 IoU / 41.30 [email protected].
In Search-based Segmentation, the best closed-source model in Table 3 is Gemini-3.1-Pro at 54.56 gIoU / 38.76 cIoU overall, compared with Pixel-Searcher at 39.17 gIoU / 32.41 cIoU; Pixel-Searcher still leads among open-source methods.

Threat model

The adversary is the task’s inherent open-world ambiguity: the query intentionally omits the target’s name and instead encodes it through indirect facts, recent events, brand relations, creators, or multi-hop clues. The system is assumed to have access to web search and public evidence, but not to hidden labels or private knowledge; the image may contain multiple plausible candidates and distractors. The model cannot assume the decisive identity is directly visible or recoverable from frozen parametric knowledge alone, and it must verify that the resolved identity matches a concrete region before outputting a box, mask, or answer choice.

Methodology — deep read

Threat model and task framing: the adversary is not a malicious user in the classical security sense, but the open-world ambiguity of the query itself. The system is given an image and a knowledge-intensive query where the decisive evidence is outside the image and may not be present in frozen model knowledge. The model is assumed to have access to web search during inference, but not to hidden ground-truth annotations. The paper explicitly assumes the query may refer indirectly to an object via creator, brand, release history, role, recent event, or multi-hop relations, and the target may be confounded by distractor objects with similar appearance. The core requirement is: resolve the hidden identity from evidence first, then ground it to a concrete region. The paper says that direct perception baselines are insufficient because they can answer from image cues or priors only when the target is already visually obvious.

Data provenance and benchmark construction: WebEyes is built object-first. The authors start from images collected from web image search, news pages, and social-media posts, focusing on multi-instance scenes with plausible distractors and on categories such as icons, celebrities, pop-culture IPs, anime/game characters, products, and vehicles. Annotators mark foreground instances, refine masks using SAM3, and save mask, box, object name, and category. For each object, the system summarizes visible cues into text, then performs a three-round chained search using the object name, category, context, and image-checkable features. The search is constrained to public evidence within a six-month window before annotation, and the evidence is expanded into multi-hop paths. From these records, the authors generate knowledge-based questions by hiding the target entity name and direct visual attributes. The benchmark includes 120 images, 473 objects, 645 unique QA pairs, and 1,927 task samples; 645 SearchGround and 645 SearchSeg samples are derived from those QA pairs, and 637 support SearchVQA. The paper does not provide a train/val/test split in the excerpted text; instead, it emphasizes that all methods use the same WebEyes inputs and splits, suggesting a fixed benchmark split, but the exact split sizes are not visible in the provided text.

Architecture / algorithm: Pixel-Searcher is a two-phase agentic pipeline. In phase 1, Agentic Search & Target Resolution, the model decomposes the query if needed and alternates among Search, Reason, and Resolve actions until it produces a structured hypothesis h = {e, c, K}, where e is the resolved entity name, c is its coarse visual category, and K is a set of image-checkable cues distilled from external evidence. This is the novel bridge between web evidence and visual grounding: instead of carrying forward the whole query, the system keeps only the resolved identity and cues that can be checked against image content. In phase 2, Agentic Grounding & Tool Use, the resolved hypothesis is used to bind the target to a region in the image. For Search-based Grounding, the verified box is returned directly. For Search-based Segmentation, the verified box is passed to a promptable segmentation tool; in the experiments this is SAM3. For Search-based VQA, the process runs in reverse: given a highlighted region and answer options, the model converts each option into an entity-level summary and picks the one best supported by the region and evidence. The paper’s equations define the resolved hypothesis as h = R(q, E1:T), grounding as b* = Abind(I, h), segmentation as yhat_seg = Tseg(I, b*), and VQA as arg max_k Avqa(I, b, ok). A concrete example in Fig. 2 shows the model decomposing a question about a device with a Mario Kart World bundle into subfacts, searching that Mario is tied to Donkey Kong under the Jumpman name, then resolving the target as Nintendo Switch 2 and grounding it to a box.

Training regime and implementation details: the excerpt does not describe end-to-end supervised training of Pixel-Searcher; instead it reads like an inference-time agentic workflow layered on top of existing multimodal models and tools. The paper explicitly says all methods use the same WebEyes inputs, splits, and task-specific output interfaces, without task-specific fine-tuning. Direct baselines predict boxes from image plus query, and segmentation baselines convert boxes to masks using SAM3. Pixel-Searcher uses Qwen3-VL-8B-Instruct as the general baseline for comparison in the open-source setting because Qwen-3.5 reportedly had weaker instruction following and often failed to output valid boxes in preliminary trials. The excerpt does not provide optimizer, learning rate, epoch count, batch size, random seed strategy, or hardware for training, which is consistent with a benchmark/inference paper rather than a training-heavy model paper.

Evaluation protocol, metrics, and a concrete end-to-end evaluation example: the authors evaluate three task views. Search-based Grounding uses IoU and [email protected]. Search-based Segmentation uses gIoU and cIoU. Search-based VQA uses exact-match accuracy over answer choices. They compare Pixel-Searcher against closed-source multimodal systems, open-source grounding/segmentation/QA-specific models, and open-source general models. Table 2 shows category-stratified grounding results across Vehicles, Pop-IP, Anime, ICON, Celebrities, and PRODUCT; Table 3 and Table 4 do the same for segmentation and VQA. The central empirical claim is not just that Pixel-Searcher beats other open-source methods, but that the gains are strongest when the target identity is ambiguous and needs web resolution before pixel localization. One concrete example: on Search-based Grounding, Qwen3-VL-8B reaches 26.81 overall IoU and 32.61 [email protected], while Pixel-Searcher reaches 34.17 IoU and 41.30 [email protected]. The ablation in Table 5 removes or simplifies grounding cues and shows that direct localization candidates are crucial: removing them drops IoU from 34.17 to 20.14 and [email protected] from 41.30 to 19.72. Reference matching and contradiction checking give smaller but consistent improvements. The failure analysis then ties the remaining errors to search/entity resolution rather than mask refinement, with only 10 of 389 failures attributed to box-to-mask transfer.

Reproducibility: the authors provide a GitHub repository and project website in the excerpt, and they present WebEyes as a released benchmark. However, the provided text does not specify whether raw source data are redistributed, whether all evidence URLs are frozen, whether model weights are released, or whether the exact annotation split and seed settings are published. The paper does make the workflow fairly inspectable through its staged pipeline and per-sample evidence chain, but several operational details needed for full reproduction are not visible in the excerpt.

Technical innovations

Defines Perception Deep Research as a new open-world visual perception setting where external web evidence must be resolved before an object can be grounded.
Introduces WebEyes, an object-anchored benchmark that ties each instance to verifiable evidence and supports grounding, segmentation, and VQA from the same annotation layer.
Proposes Pixel-Searcher, a two-phase agentic search-to-pixel workflow that first resolves a hidden entity into {entity, category, cues} and then binds that hypothesis to a box, mask, or answer.
Uses an explicit search-reason-resolve loop plus tool-assisted grounding, rather than one-shot text-to-box prediction or pure parametric reasoning.
Provides diagnostic ablations and failure decomposition showing that the dominant bottleneck is evidence acquisition and identity resolution, not mask refinement.

Datasets

WebEyes — 120 images, 473 annotated object instances, 645 unique QA pairs, 1,927 task samples — constructed from web image search, news pages, and social-media posts

Baselines vs proposed

Qwen3-VL-8B: SearchGround IoU = 26.81 vs proposed: 34.17; [email protected] = 32.61 vs proposed: 41.30
Qwen3-VL-8B: SearchSeg gIoU = 35.78 vs proposed: 39.17; cIoU = 25.94 vs proposed: 32.41
Qwen3-VL-8B: SearchVQA accuracy = 36.34 vs proposed: 42.24
Gemini-3.1-Pro: SearchGround IoU = 30.52 vs proposed: 34.17; [email protected] = 35.09 vs proposed: 41.30
Gemini-3.1-Pro: SearchSeg gIoU = 54.56 vs proposed: 39.17; cIoU = 38.76 vs proposed: 32.41
Gemini-3.1-Pro: SearchVQA accuracy = 63.82 vs proposed: 42.24
OneThinker-8B: SearchGround IoU = 32.78 vs proposed: 34.17; [email protected] = 38.70 vs proposed: 41.30
OneThinker-8B: SearchSeg gIoU = 35.60 vs proposed: 39.17; cIoU = 24.46 vs proposed: 32.41
OneThinker-8B: SearchVQA accuracy = 28.26 vs proposed: 42.24

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12497.

Fig 1: Our Perception Deep Research extends grounded perception from visual-cue reference and

Fig 2: Overview of WebEyes generation and Pixel-Searcher inference.

Fig 3: Examples of WebEyes task views: Search-based Segmentation outputs a mask, Search-based

Fig 4: WebEyes Category

Fig 5: shows the construction process. WebEyes follows an object-first workflow: each annotated object

Fig 6 (page 3).

Fig 7 (page 3).

Fig 8 (page 3).

Limitations

The benchmark is small: 120 images and 473 objects, so category diversity is limited relative to mainstream grounding datasets.
The excerpt does not expose the exact train/val/test split, seed strategy, or full implementation details needed for perfect reproduction.
Most evidence is obtained through web search within a six-month window before annotation, so performance may depend on access to current or archived web content and the exact search API behavior.
The paper shows strong open-source gains, but the gap to some closed-source systems remains substantial on segmentation and VQA, especially in fine-grained semantic comparison.
The failure analysis suggests that the system still struggles with evidence acquisition and entity resolution, meaning the approach may degrade sharply on queries requiring weak, noisy, or missing web evidence.
The benchmark construction relies on human filtering after automated generation; this may bias the final set toward questions that are cleaner or more verifiable than naturally occurring user queries.

Open questions / follow-ons

How robust is Pixel-Searcher under search engine drift, paywalled sources, or evidence retrieved from different languages and regions?
Can the search-reason-resolve loop be made more reliable by explicit contradiction tracking or citation-aware evidence scoring?
Would larger gains come from better entity disambiguation, better visual instance binding, or better benchmark design that separates those failure modes more cleanly?
How well does the approach transfer to genuinely temporal queries where the correct answer changes after the benchmark’s six-month evidence window?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the main relevance is methodological rather than directly security-specific: the paper shows how to turn a vague high-level query into a structured evidence search process and then bind the answer to a concrete visual instance. That is analogous to systems that must distinguish a real user’s intent from lookalike candidates or distractor content using external evidence rather than only local image features.

If you are designing visual bot detection, this paper is a reminder that the hard part may not be pixel classification or segmentation at all; it can be upstream identity resolution and evidence aggregation. The benchmark setup is also useful as an evaluation template for adversarial ambiguity: one could adapt the object-anchored, evidence-linked design to test whether a model can resist visually plausible but semantically wrong candidates under web-derived hints. More broadly, the failure analysis suggests that robust systems should instrument the search stage, not just the final classifier, because most errors arise before the pixel-level decision is made.

Cite

bibtex

@article{arxiv2605_12497,
  title={ From Web to Pixels: Bringing Agentic Search into Visual Perception },
  author={ Bokang Yang and Xinyi Sun and Kaituo Feng and Xingping Dong and Dongming Wu and Xiangyu Yue },
  journal={arXiv preprint arXiv:2605.12497},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12497}
}

From Web to Pixels: Bringing Agentic Search into Visual Perception ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​