AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset

Source: arXiv:2605.04949 · Published 2026-05-06 · By K. Andrew Edmonds

TL;DR

AllSERP is a dataset enrichment layer built on top of AdSERP (Latifzadeh et al., SIGIR 2025), the only public IR dataset combining eye gaze, cursor, scroll, pupil, and click telemetry against ground-truth screenshots and captured HTML from real commercial-intent Google SERPs. The core problem AllSERP addresses is that AdSERP's bounding boxes covered only ad surfaces, leaving roughly 84.5% of attributable clicks falling on untyped organic results, knowledge panels, People-Also-Ask widgets, image packs, and other SERP elements — making per-element behavioral analysis impossible on the existing corpus without re-collection.

AllSERP introduces a four-phase computer-vision and HTML-parsing pipeline that anchors pixel-accurate bounding boxes to the original SERP screenshots (rather than re-rendering stale HTML, which drifts 13–45 px in a 2026 browser), assigns semantic element types across 13 categories via an 8-tier HTML parser, fills inter-result Y-axis gaps via midpoint-split heuristic, and attributes clicks via X+Y containment. The pipeline is validated structurally against the shipped AdSERP ad rectangles (0 disagreements across 38,250 classifications, mean IoU = 1.000). Click attribution under the gap-fill flavor reaches 91.7% of the 2,776-trial corpus, with 231 trials flagged and excluded for right-rail, chrome, or missing-click reasons.

The release ships per-trial JSONs, a 37,142-row corpus CSV, a browser-based replay viewer for 147 curated trials, and a fully reproducible single-script pipeline under MIT license. A descriptive behavioral inventory across nine main-axis element types surfaces click-fixation dissociation (dd_top fixated 99.7% of the time but receives only 9.6% of clicks), position-decay in organic click rates (position 0 earns 39.5% of organic clicks; Spearman ρ = −0.624 in the organic-only flavor), and above-fold geometry versus click-outcome mismatches across element types. The dataset is a pre-AI-Overviews snapshot (collected 2022-12-16 to 2023-03-13) whose value as a behavioral baseline is argued to grow as post-SGE replications need a clean prior-era reference.

Key findings

AdSERP's shipped bounding boxes covered only ad surfaces, attributing just 15.5% of clicks; AllSERP's gap-fill flavor attributes 91.7% of trials' final clicks to a typed main-axis AOI, with 231 trials flagged as off-axis or missing (67 right-rail, 91 chrome/far-off-target, 73 no-click/pathological).
Phase C ad-vs-non-ad classification produces 0 disagreements across 38,250 individual classifications compared to the shipped AdSERP ad rectangles, with mean IoU = 1.000 and 0 of 26,590 organic bboxes overlapping any shipped ad rectangle.
dd_top (top-of-page display ads) achieves near-universal fixation (99.7% of dd_top AOIs fixated) but only 9.6% of attributed clicks, while organic results achieve 55.6% fixation and 79.1% of clicks — a sharper click-fixation dissociation than the prior ad-vs-organic aggregate permitted.
In the organic-only position flavor (Fig 3 left), position 0 earns 39.5% of organic clicks and click rate decays with Spearman ρ = −0.624; in the organic-hybrid flavor (Fig 3 right) where position 0 is the topmost main-axis card (often a dd_top), the click peak shifts to position 1 at 25.6% with ρ = −0.939.
Re-rendering saved 2022–2023 SERP HTML in a 2026 browser produces 13–45 px layout drift relative to original screenshots, motivating screenshot-anchored rather than DOM-anchored AOI extraction for downstream reuse.
Organic results appear above the fold in 97.3% of trials and capture 79.1% of attributed clicks; widgets and ads dominate above-fold geometry (dd_top above-fold in 57.0% of trials) but not click outcomes.
The corpus covers 2,775 processable trials (1 dropped for missing meta or fixations) producing 37,142 typed AOI rows under the gap-fill flavor, spanning 13 element types including organic, dd_top, native_ad, paa, image_pack, knowledge_panel, top_places, and others.
Native ads (native_ad) are fixated in 36.4% of AOIs and receive 5.9% of clicks; People-Also-Ask (paa) achieves 40.6% fixation and 1.7% of clicks, consistent with attention-without-commitment patterns across non-organic surfaces.

Methodology — deep read

Threat model and assumptions: This is not a security paper but an IR/HCI dataset enrichment. The adversarial concern, if any, is methodological: the pipeline must not introduce geometric errors that corrupt downstream click or gaze attribution. The key assumption is that the original AdSERP screenshots are pixel-faithful records of what participants saw during the 2022–2023 collection sessions, and that all gaze and cursor signals are coordinate-registered to those screenshots. A secondary assumption is that saved SERP HTML, while layout-unstable under re-rendering, is structurally stable enough for element-type labeling via HTML parsing.

Data provenance and corpus: The underlying data is AdSERP (Latifzadeh, Gwizdka, Leiva, SIGIR 2025), available on Zenodo (zenodo.org/records/15236546) under CC-BY-4.0. It contains 2,776 trials of full-page SERP screenshots, captured HTML, 150 Hz Gazepoint eye tracking, evtrack mouse telemetry, scroll events, pupil signals, and click events, collected on commercial-intent Google SERPs between 2022-12-16 and 2023-03-13. AllSERP processes 2,775 of these (1 dropped for missing metadata or fixations). No new data collection is performed; all AllSERP outputs are derived from AdSERP assets. Labels are not from human annotators — they are pipeline-produced and validated structurally.

Phase A — CV card-span extraction: The pipeline applies per-row standard-deviation row-projection to the main SERP column of each screenshot, producing card spans (vertical extents of SERP result cards). Shipped AdSERP ad rectangles take precedence over CV-detected spans on overlap, preventing ad geometry from being overwritten. A composite-trigger height threshold identifies multi-item cards (image_pack, PAA, top_stories) and triggers inner subdivision. Right-rail dd_right cards are detected but routed off the main axis. The output of Phase A is a set of vertical card spans for the main column.

Phase B — HTML structural labeling: The saved SERP HTML for each trial is walked through an 8-tier priority chain to assign element-type labels. The chain proceeds: (1) heading text patterns, (2) structural class signatures, (3) data-attrid attributes, (4) fallback structural patterns, down to a chrome sweep for footer artifacts. This produces 13 element types: organic, dd_top, native_ad, dd_right, top_places, knowledge_panel, paa, image_pack, related_searches, pagination, other_widget, unknown_widget, and chrome. Phase B outputs labels only — no geometry — because re-rendered HTML coordinates do not match screenshot pixel coordinates.

Phase C — Geometry-label binding and consistency validation: Labels from Phase B are bound to card spans from Phase A using document order as the binding axis, described as the 'only stable axis when HTML and screenshot don't share a coordinate space.' Ad identity is propagated by spatial overlap with shipped ad rectangles. The consistency validation checks that Phase C's ad classifications agree with the shipped AdSERP ad rectangles across all three ad etypes (dd_top, native_ad, dd_right): 38,250 individual classifications, 0 disagreements, mean IoU = 1.000. The paper explicitly frames this as an internal-consistency check, not independent-annotator validation — Phase C inherits ad identity from the same rectangles it is checked against. Non-ad partition (organic vs paa vs image_pack etc.) is validated only against HTML structure and visual spot-check on 147 curated trials.

Phase D — Gap-fill flavor (typed_gapfill): A midpoint-split heuristic extends adjacent organic bbox pairs so that every Y-coordinate in the main column belongs to some AOI. The upper organic bbox extends downward and the lower extends upward to their shared midpoint, clamped so organics never cross ad or widget boundaries. This produces the typed_gapfill flavor. The tight-bbox flavor (typed, Phases A–C only) is also released. Both flavors ship; the paper advises consumers to select based on their analytic question. Click attribution under typed_gapfill requires both X and Y containment within a main-axis AOI, with a ±5 px X / ±10 px Y tolerance fallback for link-padding clicks. Y-only attribution would incorrectly route right-rail and chrome clicks into main-axis AOIs; X containment refuses these.

Evaluation protocol and reproducibility: There is no machine learning training or held-out test set in the conventional sense — the pipeline is deterministic. Evaluation is (a) internal consistency against shipped ad rectangles (38,250 comparisons, reported in §3), (b) descriptive behavioral inventory across 9 main-axis element types (Table 1), and (c) position-decay analysis under two flavors (Fig 3, Spearman ρ). No statistical significance tests beyond Spearman correlation are reported. No cross-validation or adversarial evaluation is performed. Independent-annotator validation on the non-ad partition is explicitly named as future work. Full reproducibility is claimed: a single-script entry point (scripts/build_aois.py) regenerates per-trial JSONs from AdSERP screenshots and HTML; the pipeline, per-trial JSONs, corpus CSV, and replay viewer are released under MIT/CC-BY-4.0. A concrete end-to-end example: running build_aois.py --trial p010-b2-t6 processes that trial's screenshot through Phase A row-projection to detect card spans, Phase B HTML walking to assign types, Phase C document-order binding, and Phase D gap-fill extension, emitting a per-trial JSON with typed AOIs; the replay viewer at andyed.github.io renders colored AOI rectangles, numbered fixation circles, cursor polyline, and pupil/LF/HF sparklines for visual verification (Fig 2).

Technical innovations

Screenshot-anchored AOI extraction via CV row-projection, deliberately bypassing re-rendering of saved SERP HTML (which drifts 13–45 px in a 2026 browser) and instead anchoring geometry to the same pixel space as the recorded gaze and cursor streams — prior AdSERP work extracted AOIs from the live DOM at collection time, a route unavailable for downstream reuse.
8-tier HTML priority chain for element-type labeling that decouples structural classification from geometric extraction, enabling stable semantic labels even when HTML layout coordinates are unreliable; prior AdSERP shipped only an ad-vs-non-ad split.
Midpoint-split typed_gapfill flavor that extends adjacent organic bboxes to a shared boundary so every Y-coordinate in the main column is attributable to an AOI, with X+Y containment click attribution that rejects right-rail and chrome clicks — prior analyses on AdSERP used Y-only or absolute-rank attribution that could not make this distinction.
Trial-level off-axis flag (is_main_axis_click) that categorizes 231 trials by failure mode (67 dd_right, 91 chrome/far-off-target, 73 no-click/pathological), providing a principled exclusion criterion that was absent from the original AdSERP release.
Dual-flavor release (typed tight-bbox and typed_gapfill) with explicit guidance on which flavor matches which analytic question (e.g., 'what does the top organic earn?' vs 'what does the topmost SERP slot earn?'), illustrated by the divergent Spearman ρ values in Fig 3 (−0.624 vs −0.939).

Datasets

AdSERP — 2,776 trials, full-page SERP screenshots + HTML + 150 Hz gaze + evtrack cursor + scroll + pupil + clicks — Latifzadeh, Gwizdka, Leiva, SIGIR 2025; Zenodo CC-BY-4.0 (zenodo.org/records/15236546)
AllSERP derived outputs — 37,142 typed AOI rows across 2,775 trials; per-trial JSONs (~19 MB); corpus CSV; cursor-approach feature file — released CC-BY-4.0 via attentional-foraging and approach-retreat GitHub repos

Baselines vs proposed

AdSERP shipped ad-only bboxes: click attribution coverage = 15.5% of attributable clicks vs AllSERP typed_gapfill: 91.7% of corpus trials' final clicks attributed to a main-axis AOI
AdSERP ad-vs-organic split (dd_top fixation vs click): fixation coverage not separately reported in original vs AllSERP per-element: dd_top fixated % = 99.7%, click % = 9.6%
AdSight four-slot taxonomy (prior work on AdSERP): 4 slot types vs AllSERP: 9 main-axis element types across 37,142 AOI rows
Organic-only position flavor (Fig 3 left): position-0 click rate = 39.5%, Spearman ρ = −0.624 vs organic-hybrid flavor (Fig 3 right): position-1 click rate = 25.6% (peak), Spearman ρ = −0.939

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.04949.

Fig 1

Fig 1: The four-phase pipeline applied to a synthetic SERP. (A) CV row-projection on the main column produces card spans

Fig 2

Fig 2: Replay viewer rendering of trial p010-b2-t6 (viewer

Fig 3

Fig 3: Click rate by position under the two main flavors AllSERP releases. Left: organic-only flavor — position 0 is the topmost

Limitations

Forced-choice task design: AdSERP elicits exactly one click per query within a 1–2 minute window; the corpus does not capture query refinement, reformulation, pagination, abandonment, or multi-query sessions, so behavioral statistics reflect constrained-choice SERP evaluation rather than naturalistic free search.
Non-ad partition validation is not independent-annotator validated: the organic/paa/image_pack/widget type assignments are checked only against the HTML structure that Phase B consumes and visual spot-check on 147 curated trials; independent-annotator validation on a held-out subset is explicitly deferred to future work.
Phase D midpoint-split is a heuristic, not DOM-derived ground truth: inter-result gap sizes range 5–60 px, meaning misattribution is bounded but not zero; the paper ships both flavors as a workaround but does not quantify actual misattribution rate.
Right-rail coverage gap: approximately 1% of trials show a right-rail click with no shipped rectangle; this is inherited from AdSERP and not resolved by AllSERP, which tracks only shipped dd_right rectangles for off-axis classification.
Pre-AI-Overviews snapshot only (collection ended 2023-03-13, three days before Bard launch, two years before broad AI Overviews deployment): results do not generalize to current Google SERPs with AI answer cards, which would require an additional ai_overview etype and have substantively different click-fixation dynamics.
Snippet-TTR rank confound: the released content-feature file carries a noted caveat that snippet text-type ratio is confounded with rank position; consumers must partial out this confound before making per-position content claims — the paper flags this but does not resolve it.

Open questions / follow-ons

How do per-element click-fixation dissociation patterns (e.g., dd_top fixated 99.7% but clicked 9.6%) change after AI Overviews rollout, where an AI answer card occupies the topmost slot? A post-May-2024 replication using Phase B with an ai_overview etype extension is the natural next step.
Can cursor approach-retreat episode geometry computed per typed AOI (rather than per rank) distinguish decision hesitation patterns that differ by element type — e.g., does approach-retreat before an organic result have different kinematic signatures than approach-retreat before a PAA widget?
What is the actual misattribution rate of the midpoint-split gap-fill heuristic relative to DOM-derived ground truth? A small-N study collecting live DOM coordinates alongside screenshots on matched queries could bound this empirically.
Can pupillometric cognitive-load signals (LF/HF, RIPA2) conditioned on AllSERP element types reveal differential cognitive effort across SERP surface types (organic inspection vs knowledge-panel scan vs ad-card glance), and does that effort profile predict click outcomes within element type?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, AllSERP's primary relevance is as a ground-truth behavioral substrate for distinguishing human SERP interaction from bot-simulated interaction at element-type granularity. The corpus provides 150 Hz gaze, 150+ Hz cursor telemetry, scroll, and pupil signals from real humans performing commercial-intent search tasks, now labeled at the per-element level (organic results, ads, PAA widgets, knowledge panels, etc.). A bot-defense engineer could use the typed AOI geometry and cursor-approach-features file to characterize what human approach-retreat kinematics look like specifically when interacting with organic results versus ads — a level of conditioning that was previously unavailable. The click-fixation dissociation findings (e.g., dd_top fixated 99.7% but clicked 9.6%) are directly relevant to bot detection: a bot script that clicks top-of-page ads at rates inconsistent with this human baseline would be anomalous in a way that the prior ad-vs-organic aggregate could not sharply characterize.

The limitations matter equally for practitioners: the corpus uses a forced-choice task design (one click per query, 1–2 minute window) rather than naturalistic search, which means the human behavioral distribution may not match real user populations on production SERPs. The pre-AI-Overviews snapshot (collected before March 2023) means any model trained on this data is calibrated to a SERP layout that no longer exists at scale, reducing direct applicability to current Google SERPs. That said, the pipeline is era-agnostic and the methodology for producing per-element behavioral baselines from screenshot-anchored CV and HTML parsing is transferable — a practitioner running a CAPTCHA product that renders search-like interfaces or harvests SERP interaction signals could adapt the four-phase pipeline to their own corpus with only Phase B's label set requiring modification for new surface types.

Cite

bibtex

@article{arxiv2605_04949,
  title={ AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset },
  author={ K. Andrew Edmonds },
  journal={arXiv preprint arXiv:2605.04949},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.04949}
}

AllSERP: Exhaustive Per-Element Enrichment of the Versatile AdSERP Dataset ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​