SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Source: arXiv:2605.14205 · Published 2026-05-14 · By Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang

TL;DR

SimPersona addresses the limitation of large language model (LLM)-based e-commerce agents that typically simulate buyers as a single "average" persona, failing to capture the diversity and distribution of real buyer behaviors. The paper introduces a novel framework that learns discrete, behaviorally meaningful buyer personas from raw clickstream data using a behavior-aware vector-quantized variational autoencoder (VQ-VAE). Each buyer is assigned a discrete persona token, which is integrated into the LLM vocabulary and used to condition fine-tuned agents, enabling scalable assignment of personas without prompt engineering or retraining for each store. Evaluated on 8.37 million buyers across 42 held-out live storefronts, SimPersona achieves strong alignment with real-world buyer conversion rates (78% average action alignment) and produces interpretable behavioral variation across personas, outperforming a baseline model with 8× more parameters on goal-oriented shopping tasks.

Key innovations include the combination of discrete persona discovery and LLM grounding through a two-stage fine-tuning process, enabling robust transfer to unseen stores without adaptation. The authors also provide an open-source data pipeline to process raw noisy e-commerce logs into buyer representations and agent training traces. Overall, SimPersona offers a scalable, behaviorally faithful approach to simulating realistic multi-persona e-commerce buyer populations beyond brittle handcrafted prompt personas.

Key findings

SimPersona achieves 78.7% stratified action rate alignment (ARA) with real buyers across funnel strata A–D compared to 57.4% alignment for mismatch baselines (Table 2).
Behavior-aware VQ-VAE reduces cluster incoherence to <0.5% vs. 66.7% under MiniBatch k-means for engagement coherence, indicating semantically coherent personas (Table 1).
Persona tokens induce strong behavioral separation with Cohen's d=2.80 for purchase intensity and significant t-test p-values <10^-14 (Table 3).
SimPersona outperforms GPT-OSS-120B baseline on instruction-following shopping tasks with higher cart and checkout success rates by +9.7pp and +23.8pp respectively (Table 4).
Two-stage fine-tuning (token embedding warmup plus full fine-tuning) is essential to learn meaningful persona token grounding, preventing shortcut learning (Section 2.3 and Appendix F).
SimPersona generalizes to 42 unseen live storefronts with no per-store adaptation, assigning personas via a single forward encoder pass (Section 3.2 and Figure 7).
Exploration behavioral axis did not transfer reliably to agent behavior, possibly due to label skew and simulation floor effects (Table 3).
Open-source data pipeline converts raw event logs and raw clickstreams into semantically enriched buyer-session vectors and agent training traces.

Threat model

n/a — The paper focuses on modeling and simulating buyer personas rather than on adversarial threat scenarios. No explicit adversarial capabilities or attacks are considered. The threat model assumes benign historical clickstream data from real buyers for supervised learning of behaviorally meaningful clusters and agent conditioning.

Methodology — deep read

Threat Model & Assumptions: The adversary is not directly modeled, as this work focuses on buyer persona discovery and agent conditioning rather than adversarial attack or defense. The assumed scenario is a platform with raw clickstream logs from buyers interacting with storefronts, aiming to faithfully model heterogeneous real buyer behaviors for simulation, not robust adversarial evasion.
Data: The authors collected raw event-level clickstream logs (page views, searches, cart and checkout actions) from production e-commerce systems across multiple shops. Data comprised 8.37 million unique buyers across 42 held-out storefronts for evaluation, and a training subset of 44,559 buyer-shop pairs sampled from 39 shops balanced across buyer funnel strata A–E (purchasers to bouncers). Data was enriched by joining with product catalog metadata and embedding product titles into 768-d vectors reduced to 128-d by PCA. Behavioral features included 16 scalar engagement and funnel metrics. Buyers were assigned to one of five funnel strata based on their purchase funnel stage. Data was split 85/15 train/validation. Preprocessing aggregated sessions into buyer-session vectors and converted browsing sessions into multi-turn supervised fine-tuning (SFT) agent traces using GPT-5 and Gemini 3 Flash LLMs.
Architecture / Algorithm: SimPersona is a behavior-aware VQ-VAE with a learned codebook of K discrete embeddings. The encoder maps 403-d buyer-session vectors through linear+ReLU+dropout layers to a latent vector z_e ∈ R^D which is quantized via nearest neighbor lookup to a discrete codebook entry e_k. The decoder reconstructs buyer features to preserve behavioral and product-preference signals. The VQ-VAE loss combines reconstruction loss (group-aware to balance scalars and embeddings), vector quantization commitment loss, a three-stage contrastive InfoNCE loss on funnel strata and behavioral similarity, plus three auxiliary cross-entropy losses predicting coarse bins (low/medium/high) for engagement, exploration, and purchase intensity. This encourages semantically meaningful, behaviorally interpretable persona clusters that capture latent buying behavior.

The learned discrete codes are directly mapped to newly introduced special persona tokens in the vocabulary of a Qwen3-14B-Base LLM agent. This enables conditioning the LLM on compact persona tokens to drive behavior.

Training regime: VQ-VAE and auxiliary heads were trained on the balanced 44,559 buyer-session dataset. Codebook entries were initialized by k-means++, updated via exponential moving averages, and dead embeddings were reinitialized by random samples. The LLM was fine-tuned in two stages: (a) persona grounding phase with frozen backbone and training only persona token embeddings using intent-neutral short product interest goals to learn what each persona token means in behavior distribution, and (b) action-oriented phase with fully unfrozen backbone and explicit buy/browse goal intents to learn how to act on persona token and session goal synergy. The fine-tuning used a small session corpus (~3,600 sessions) disjoint from the evaluation stores to promote generalization. Hyperparameters and seeds are detailed in appendices; inference scales to millions of buyers via single encoder forward pass.
Evaluation protocol: Conversion alignment metrics compared agent simulated actions (add-to-cart, checkout) to matching buyer cohorts grouped by persona token and funnel strata, computing alignment scores of 1 - absolute difference in rates, averaged. Baselines include mismatched token pairings and a strong large-parameter GPT-OSS-120B model. Behavioral fidelity was tested by grouping simulated sessions by persona auxiliary label (engagement/purchase bins) and applying t-tests, Cohen’s d, Kruskal-Wallis, and permutation tests on simulated action distributions. Instruction-following was tested on 70 deterministic shopping tasks sampled from real buyer sessions repeated 10 times each, measuring success rates and step counts. The method was evaluated on fully held-out 42 storefronts unseen during training. Ablations tested single- vs two-stage fine-tuning.
Reproducibility: The authors release an open-source data pipeline converting raw event logs into buyer-session vectors and agent traces. The VQ-VAE training and token grounding code are integrated with this pipeline. However, raw clickstream data and models weights are not fully public, limiting complete reproducibility. Detailed appendices with hyperparameters and training protocols are provided. The approach depends on proprietary GPT-5 and Gemini 3 Flash LLMs for key steps, which may limit exact replication. Overall, the method is described thoroughly enough for partial reproduction with accessible components.

Example end-to-end flow: A buyer’s raw clickstream data is enriched with product embeddings, converted into a 403-d feature vector, encoded by a trained VQ-VAE into a latent code corresponding to one of K discrete personas, which maps to a persona token. This token is fed to the fine-tuned LLM agent along with a browsing intent prompt, conditioning it to simulate behavior matching that buyer type on a live storefront, producing actions and reasoning traces. The agent’s simulated conversion rates and behavioral patterns closely match those of real buyers with that persona code, validating the approach.

Technical innovations

Behavior-aware VQ-VAE that jointly optimizes reconstruction, commitment, three-stage contrastive loss, and auxiliary behavioral classification heads to learn semantically meaningful discrete buyer personas from raw clickstreams, unlike geometric clustering methods.
Two-stage supervised fine-tuning framework that first freezes the LLM backbone to ground persona token embeddings with intent-neutral prompts, then fully fine-tunes with explicit session intent to robustly fuse persona and task signals, preventing shortcut learning.
Direct integration of learned discrete persona codes as new special tokens in the LLM vocabulary enabling compact, scalable, single-pass persona assignment and conditioning without expensive prompt engineering or retraining.
Open-source end-to-end pipeline converting raw fragmented production e-commerce logs into semantically enriched buyer-session vectors and multi-turn LLM training traces for grounded agent simulation.

Datasets

Balanced buyer-session dataset — 44,559 buyer-shop pairs from 39 shops — proprietary production logs, balanced across funnel strata A–E for training and validation
Evaluation dataset — 8.37 million unique buyers across 42 held-out live storefronts — proprietary production logs

Baselines vs proposed

MiniBatch k-means (K=256): engagement incoherence = 66.7% vs VQ-VAE: 0.5%
MiniBatch k-means: stratum purity = 78.9% vs VQ-VAE: 84.5% (Table 1)
SimPersona token-aligned agents: stratified Action Rate Alignment (ARA) = 78.7% vs mismatch baselines: 57.4% (Table 2)
SimPersona vs GPT-OSS-120B on instruction-following: cart add success 91.4% vs 81.7%, checkout success 76.9% vs 53.1% (Table 4)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.14205.

Fig 1

Fig 1: SIMPERSONA framework overview. Top-left: behavioral features and product embeddings

Fig 2

Fig 2: illustrates our end-to-end data pipeline described in Section 2.1. As mentioned in Sec-

Fig 3

Fig 3: Data enrichment. Raw event-level tables are joined with the product catalog, collection

Fig 4

Fig 4: VQ-VAE input construction for a single buyer–shop pair. top: product embeddings are

Fig 5

Fig 5: SFT trace generation from enriched clickstreams. Row 1: the enriched session record

Fig 6

Fig 6: Stratum distribution recovery across all 42 storefronts. Solid bars show the real funnel-stage

Fig 7

Fig 7: Store-level behavioral reconstruction from persona token distributions. The codebook

Fig 8

Fig 8: Per-shop error-rate comparison between two-stage and single-stage SFT (sorted by two-

Limitations

Exploration axis behavioral distinctions do not reliably transfer to simulated agents possibly due to skewed label distributions and simulation setup removing near-zero exploration sessions.
Evaluation is limited to interaction alignment and instruction following; no adversarial or robustness tests against malicious buyers or spoofed clickstreams.
Dependency on powerful proprietary LLMs (GPT-5, Gemini 3 Flash) for data enrichment and agent simulation limits reproducibility and accessibility.
Data and models are not fully public; raw clickstream datasets remain private, limiting external verification.
Codebook size K and other hyperparameters are fixed; impact of different settings and scalability limits not fully explored.
Simulation fidelity depends on the quality of reconstructed navigating traces; errors or artifacts in rewriting/replaying sessions may affect agent training.

Open questions / follow-ons

Can the exploration behavioral axis be better captured or leveraged in simulation by improved labeling or alternative metrics?
How well do SimPersona agents generalize under distribution shifts, e.g., new product types, seasonal patterns, or emerging buyer behaviors?
What are the robustness limits when agents encounter adversarial or manipulated clickstream data during persona assignment or simulation?
Can the approach be extended to multi-modal buyer signals beyond clickstreams, such as reviews, ratings, or social interactions?

Why it matters for bot defense

Bot-defense and CAPTCHA practitioners aiming to detect or simulate human behavior on e-commerce sites can draw valuable insights from SimPersona’s discrete persona conditioning. By learning compact, statistically grounded behavioral clusters from raw clickstreams, defenders could differentiate between genuine heterogeneous buyer types and superficial or scripted bots that lack such distributional fidelity. The integration of discrete persona tokens into an LLM agent provides a scalable mechanism for generating realistic multi-persona simulations that reflect real population mixtures, improving the realism of synthetic traffic used for anomaly detection or end-to-end testing.

However, challenges remain for bot detection applications, including understanding how adversarial bots might mimic or subvert such behavioral clusters, and how to robustly assign personas from noisy or manipulated trace data. Additionally, exploration of persona-aware detection thresholds or dynamically adapting defenses to population shifts could be an interesting research direction inspired by this framework. Overall, SimPersona sets a new standard for grounded buyer persona simulation that could inform next-generation behavioral modeling in bot defense systems.

Cite

bibtex

@article{arxiv2605_14205,
  title={ SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents },
  author={ Zahra Zanjani Foumani and Alberto Castelo and Shuang Xie and Ted Chaiwachirasak and Han Li and Lingyun Wang },
  journal={arXiv preprint arXiv:2605.14205},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.14205}
}

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​