SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

Source: arXiv:2605.19219 · Published 2026-05-19 · By Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan et al.

TL;DR

The paper addresses the challenge of costly and slow A/B testing in e-commerce, where diverting real traffic to experimental variants can degrade user experience and require weeks for statistical significance. It presents SimGym, a unified framework that uses vision-language model (VLM) powered synthetic browser agents to simulate shopping sessions on live storefront variants and predict user behavioral shifts. SimGym integrates a traffic-grounded persona generation pipeline that derives diverse buyer archetypes and shopping intents from real clickstream data, a multimodal agent architecture that combines visual and structured webpage inputs with memory and action guardrails, and an evaluation protocol comparing simulated outcome shifts with observed results from real A/B tests. Validated on 50 real-world A/B theme-change experiments from a major e-commerce platform covering 16 countries and 11 product categories, SimGym achieves 77% directional alignment and 0.55 Pearson correlation with human add-to-cart (A2C) shifts, while reducing test cycle times from weeks to minutes. This demonstrates that synthetic agents can reliably pre-test UI changes with minimal risk exposure to real users.

Key findings

SimGym attains 77% directional alignment and 0.55 Pearson correlation with observed human add-to-cart (A2C) shift directions and magnitudes across 50 real storefront A/B tests.
Using a vision-enabled Gemini 3 Flash model improves alignment from 70% to 77% and correlation from 0.49 to 0.55 compared to text-only agent input, showing the value of visual multimodal perception.
Open-source language models (GPT-OSS) with text-only input achieve 59% alignment and 0.41 correlation, indicating reasonable though lower predictive validity without visual context.
Full buyer persona input (archetype plus shopping intent) is critical: reducing input to only shopping intent drops alignment to 44% and correlation near zero, and product-only input yields near-chance alignment (51%) and negative correlation (-0.14).
Removing episodic session memory causes directional alignment to halve from 77% to 42% and correlation to drop to zero, due to navigation incoherence and frequent loops.
Agent simulation runtime per shop is low, completing in about 5.3 minutes for 600 agents with Gemini 3 Flash vision-enabled model.
Directional and magnitude predictive validity stabilize at around 300 agents per shop, with diminishing returns beyond that.
Shopping persona population construction is fully automated from historical clickstream data using clustering, product preference extraction, and profile composition to reflect real user heterogeneity per merchant.

Threat model

The framework assumes an analyst or merchant wishing to pre-test storefront UI experiments without deploying variants to real users due to cost and risk of exposure. No explicit adversarial agent attacks or manipulations are considered; synthetic agents model normal buyer behavior grounded in historical traffic. The threat model excludes active attackers trying to deceive or exploit the system.

Methodology — deep read

The paper presents an end-to-end framework to simulate e-commerce A/B tests using synthetic agents powered by vision-language models acting in live browsers. The threat model assumes a merchant or analyst wishing to pre-assess storefront UI changes without diverting real traffic or risking degraded user experience; the adversary dimension is not directly discussed but synthetic agents model normal buyer behavior based on observed data.

Data provenance includes proprietary production clickstream data from a major e-commerce platform, spanning 50 curated storefront pairs across 16 countries and 11 product categories. Historical clickstream sessions (disjoint from A/B test periods) are used to generate buyer persona archetypes and shopping intents, which ground agents in realistic heterogeneity.

Persona generation follows a six-stage pipeline: clustering user sessions by behavioral features; extracting product preferences per cluster using GPT-5; generating structured shopping intents (product targets plus purchase decision guides) conditioning browsing behavior; aggregating buyer behavioral patterns; building multi-dimensional buyer archetypes across behavioral and value metrics; and composing final personas merging archetypes and intents as agent prompts.

SimGym’s agent architecture consists of a live-browser VLM-powered agent interacting in an observe-plan-act loop with multimodal inputs: DOM-derived accessibility trees (semantic webpage structure) and rendered page screenshots provide complementary perception. The VLM plans actions and termination decisions conditioned on shopping goals, personas, episodic session memory (all past observations, actions, and outcomes), and execution guardrails that prevent loops and handle errors. Actions include clicks, scrolls, and navigation executed via a browser controller.

Training details of the LLM/VLM agents are not extensively described, as commercial models (Gemini 3 Flash) and open-source GPT variants are used through APIs or internal clusters. The core innovation is in prompt engineering, memory management, and system integration.

Evaluation uses a curated "golden" dataset of 50 real theme-change A/B test pairs with validated human add-to-cart shifts as ground truth. The protocol excludes confounding experiments with promotions, pricing or assortment shifts. Synthetic agents simulate sessions (600 agents per shop, repeated twice) on both control and treatment variants. Evaluation metrics include directional alignment percentage (agreement in sign of A2C shift) and Pearson correlation of magnitudes between synthetic and real outcome shifts, with 95% confidence intervals.

Ablation studies test the effect of persona input (full archetype plus intent vs intent only vs product only) and episodic memory access (with vs without). Removing buyer archetypes or session memory significantly degrades performance, confirming their importance for coherent multi-step shopping simulation.

Sensitivity analysis using bootstrap resampling determines that 300 agents per shop is sufficient to stabilize predictive metrics with low variance. The system runs simulations in minutes per shop, enabling rapid iteration.

The evaluation does not include adversarial attacks or distribution shift tests beyond the selected shop set. Code release and weights are not mentioned explicitly, suggesting closed source. The dataset is proprietary.

One end-to-end example: a synthetic agent with a full persona is initialized for a shop, assigned a shopping goal and archetype parameters derived from clustering. The agent observes the live storefront page both visually and via DOM, reasons with the VLM about next steps conditioned on browsing memory, clicks product images or cart buttons through the browser controller, and iterates until session termination. Results aggregate add-to-cart outcomes across agents to compute predicted A2C shift, which is then compared to the real human-measured shift from the paired A/B test.

Technical innovations

A traffic-grounded buyer persona generation pipeline that automatically extracts realistic buyer archetypes and shopping intents from merchant-specific production clickstream data.
A multimodal vision-language model (VLM) agent architecture combining DOM-based semantic webpage representation and visual screenshots for perception, integrated with episodic memory and execution guardrails to support coherent multi-step shopping in live browser environments.
An end-to-end synthetic A/B testing framework producing predictive behavioral shift metrics by simulating buyer agents on paired control and treatment storefront variants, validated against observed real human A/B outcomes.
A novel evaluation protocol comparing simulated and observed real buyer add-to-cart (A2C) outcome shifts, focusing on directional alignment and Pearson correlation as predictive validity metrics across diverse shops.

Datasets

Golden A/B Test Storefront Dataset — 50 paired control-treatment storefront variants from a major e-commerce platform — proprietary Shopify production data

Baselines vs proposed

Gemini 3 Flash (vision enabled): alignment = 77%, correlation = 0.55 vs text-only Gemini 3 Flash: alignment = 70%, correlation = 0.49
Gemini 3 Flash (text only): alignment = 70%, correlation = 0.49 vs GPT-OSS (text only): alignment = 59%, correlation = 0.41
Full Persona input: alignment = 77%, correlation = 0.55 vs Shopping Intent Only: alignment = 44%, correlation ≈ 0
Full Persona input: alignment = 77%, correlation = 0.55 vs Product Only: alignment = 51%, correlation = -0.14
With Episodic Session Memory: alignment = 77%, correlation = 0.55 vs No Memory: alignment = 42%, correlation = 0

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.19219.

Fig 1

Fig 1: SimGym framework overview.

Fig 2

Fig 2: Dataset distribution of the 50-storefront golden set spanning 16 countries and 11 industries.

Fig 3

Fig 3: Human-agent agreement in A2C shifts. Each panel plots human-observed versus agent-

Fig 4

Fig 4: Agent sample-size sensitivity on the 50-shop golden set. Shaded bands denote the 10th–90th

Fig 5

Fig 5 (page 8).

Fig 6

Fig 6 (page 9).

Fig 7

Fig 7 (page 9).

Fig 5

Fig 5: Buyer Archetype Construction Framework.

Limitations

Evaluation is conducted only on a single major e-commerce platform; generalization to other merchants or industries is unproven.
The framework focuses on add-to-cart (A2C) shifts as the primary behavioral metric, without evaluation on downstream conversion or revenue impacts.
No adversarial robustness or distribution shift tests are reported, limiting insight into synthetic agent behavior under unexpected interface changes.
Model training details and reproducibility claims, including code and weights, are not provided, limiting external verification.
Persona generation relies on clustering and GPT-5 based extraction, which may inherit biases or be sensitive to input noise.
No analysis of potential error modes from simulated user heterogeneity or multi-agent interaction interference is discussed.

Open questions / follow-ons

How well does SimGym generalize to other e-commerce platforms with different storefront architectures, product assortments, or traffic characteristics?
Can the framework extend beyond add-to-cart metrics to predict downstream conversion rates, order values, or user retention?
What is the impact of further aligning agent behavior with human traces via post-training or reinforcement learning on predictive validity?
How robust are the synthetic agents to distributional shifts, unusual promotional campaigns, or adversarial UI manipulations?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, SimGym represents a compelling example of using synthetic, multimodal agents to emulate complex user interactions in a web environment, calibrated against real traffic data. The use of episodic memory, multimodal perception combining DOM and visual cues, and behavioral persona conditioning highlights critical design patterns for realistic automation. Understanding how these agents predict human response shifts to UI modifications can inform CAPTCHA systems aiming to distinguish genuine users from bots by modeling expected user behaviors and detecting deviations. Moreover, SimGym’s reliance on robust guardrails and error handling demonstrates practical mitigation strategies to maintain agent reliability when interacting with dynamic web content. While focused on ecommerce experimentation, the techniques could inspire improved automated user simulators for CAPTCHA evaluation or adversarial bot detection systems. However, the approach requires significant labeled behavioral history and integration with live browser environments, implying challenges in generalized bot-risk scenarios where such ground truth is less available.

Cite

bibtex

@article{arxiv2605_19219,
  title={ SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents },
  author={ Han Li and Vibhor Malik and Zahra Zanjani Foumani and Alberto Castelo and Shuang Xie and Ailin Fan and Keat Yang Koay and Yuanzheng Zhu and Meysam Feghhi and Ronie Uliana and Zhaoyu Zhang and Angelo Ocana Martins and Mingyu Zhao and Francis Pelland and Jonathan Faerman and Nikolas LeBlanc and Aaron Glazer and Andrew McNamara and Zhong Wu and Lingyun Wang },
  journal={arXiv preprint arXiv:2605.19219},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.19219}
}

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​