Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

Source: arXiv:2606.11118 · Published 2026-06-09 · By Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

TL;DR

This paper addresses the problem of dynamic assortment optimization on two-sided online platforms where both customer and seller preference parameters are initially unknown. Unlike classical one-sided assortment problems that only model customer preferences, this work studies the realistic setting where customers propose transactions to a subset (assortment) of sellers offered by the platform, and sellers then decide whether to accept proposals after a fixed number of periods. Both customer and seller choices are modeled by multinomial logit (MNL) models with unknown parameters, creating a complex two-way learning challenge with delayed feedback on payoffs. The authors formulate this as a dynamic two-way learning (TWL) problem and design a novel upper confidence bound (UCB)-based algorithm that simultaneously learns parameters of both sides' models while optimizing assortment selection over time. Their main theoretical result proves the algorithm achieves a worst-case regret growing polylogarithmically in the time horizon T, a significant improvement over prior one-sided learning approaches that incurred regret scaling on order square root of T. They also establish matching lower bounds showing this regret rate is optimal. Extensive simulations validate the approach across scenarios and demonstrate superiority over algorithms adapted from prior one-way learning methods. Extensions show robustness to varying platform information sharing and reward dependence on customer types. Overall this is the first work to rigorously address dynamic assortment with unknown two-sided preferences and delayed multi-period matching decisions in online platforms.

Key findings

The proposed UCB-like algorithm achieves worst-case regret bounded by O(log²(NT)), where N is number of sellers and T is time horizon, representing polylogarithmic growth (Theorem 1).
This regret bound substantially improves over prior one-sided learning results with regret scaling as ~O(√T) (Agrawal et al. 2019, Aznag et al. 2021).
A matching lower bound of Ω(log²(NT)) is proved for the TWL problem, establishing rate optimality of the algorithm (Theorem 2).
Simulation studies show the algorithm outperforms an adapted version of Agrawal et al.'s UCB algorithm designed for one-way learning, across all tested problem instances.
Increasing the maximum assortment size B beyond a threshold yields negligible improvement in total platform reward, indicating diminishing returns on assortment size.
The algorithm's performance remains robust regardless of how much customer information is shared with sellers (Theorem 3).
The algorithm also remains effective whether platform rewards depend on customer types or not (Theorem 4).

Threat model

The adversary is essentially the uncertainty in unknown and heterogeneous customer and seller preference parameters governing their MNL choice models. These parameters and the stochastic arrival sequence of customers are initially unknown and must be learned online. The adversary does not actively strategize or manipulate platform decisions. The platform receives only indirect feedback delayed by epochs, complicating learning.

Methodology — deep read

Threat Model & Assumptions: The platform manager controls which subset of sellers (assortment) to display to each arriving customer but starts with no knowledge of the preferences of customers or sellers. Adversaries are not explicitly considered, but the problem models uncertainty in customer type arrivals and unknown multinomial logit (MNL) choice parameters for both sides. Feedback on seller decisions only occurs after fixed-length epochs, creating delayed and partial observability.
Data: The model assumes discrete time horizon T split into M epochs of K periods each, with N sellers and multiple customer types C. Customer arrivals are stochastic with unknown sequence, and customer types are revealed upon arrival but unknown beforehand. The platform observes for each period the offered assortment, customer choice (proposal), and at epoch end, seller acceptances. The MNL choice probabilities for customers and sellers depend on unknown parameter matrices V (customer preferences) and U (seller preferences).
Architecture / Algorithm: The policy is a UCB-based online dynamic assortment algorithm tailored to the two-way learning problem. It maintains confidence bounds on unknown MNL parameters for customers and sellers, updates estimates based on observed customer proposals and seller acceptances after each epoch, and selects assortments balancing exploration of uncertain preferences and exploitation of estimated optima. The algorithm leverages the structure of multinomial logit models and the delayed feedback framework to prune the search space and update parameter estimates.
Training / Operation Regime: The horizon T and epoch length K define the update intervals. The algorithm iteratively offers assortments each period and updates parameters at epoch ends when seller choices are observed. The paper does not specify hardware or random seed details as it focuses on theoretical guarantees and simulations.
Evaluation Protocol: Performance is evaluated by the regret metric, comparing expected cumulative reward of the algorithm against a clairvoyant policy that knows all customer and seller MNL parameters and the customer arrival sequence in advance. The regret upper and lower bounds are theoretically derived (Theorems 1 & 2). Simulations compare the algorithm to a baseline adapted from Agrawal et al. (2019)'s one-sided UCB method across multiple scenarios varying assortment size, number of sellers, and customer heterogeneity.
Reproducibility: The paper mentions simulation studies but does not specify code or data releases. The model relies on synthetic problem instances parameterized by MNL preference matrices. Detailed enough mathematical descriptions allow reproduction of the algorithm and experiments, although no explicit code repository is cited.

Example End-to-End: Consider a period t in epoch m where a customer of unknown type arrives. The algorithm selects an assortment St based on current parameter estimates and confidence bounds. The customer, according to the unknown customer MNL parameters V, proposes to a seller in St or chooses outside option. The proposal updates the proposal matrix C(t). At epoch end, sellers choose which proposal to accept according to unknown seller MNL parameters U based on counts of proposals received. Seller choices yield rewards observed by the platform, which updates parameter confidence estimates and refines future assortment decisions. This repeat interaction allows two-way learning and convergence toward optimal assortments minimizing regret.

Technical innovations

Formulation of dynamic assortment selection under two-sided incomplete information with unknown MNL parameters for both customers and sellers, creating a fundamentally new two-way learning problem.
Development of a novel UCB-based online learning algorithm that simultaneously estimates parameters of both sides’ MNL models amid delayed, epoch-wise feedback.
Proof that the new algorithm achieves polylogarithmic worst-case regret O(log²(NT)) vs prior one-sided learning algorithms with Õ(√T) regret, establishing a substantial theoretical improvement.
Establishing matching lower bounds that show this regret rate is rate-optimal, characterizing the fundamental difficulty of two-sided dynamic assortment learning.
Demonstrating the algorithm's robustness to information sharing variations and reward dependence, supporting its applicability to realistic platform scenarios.

Baselines vs proposed

Adapted Agrawal et al. (2019) one-sided UCB algorithm: significantly higher regret and lower total reward performance compared to proposed two-way learning algorithm across all tested problem instances.
Proposed algorithm worst-case regret: O(log²(NT)) vs one-sided algorithms regret ~O(√T) in dynamic assortment settings.

Limitations

Theoretical analysis assumes multinomial logit choice models for both customers and sellers, which may limit applicability to settings with more complex preference structures.
Simulation evaluation is conducted on synthetic data; real-world datasets or platform deployment results are not provided.
Algorithm requires knowledge of number of sellers N, maximum assortment size B, and epoch length K as fixed parameters.
Delayed feedback via fixed-length epochs creates practical challenges that may complicate applicability to platforms with more fluid or asynchronous seller decisions.
The model does not explicitly consider strategic behavior or adversarial manipulation by participants.

Open questions / follow-ons

How would the approach extend to more sophisticated or non-MNL choice models capturing richer preference or strategic interactions?
Can the algorithm be adapted to continuous customer arrival streams and asynchronous, real-time seller decisions rather than fixed-length epochs?
What are the effects of incorporating strategic or adversarial participants who may misreport or manipulate proposals?
How would empirical performance hold on real-world platform data, including effects of noise, missing data, or nonstationarity in preferences over time?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners in online platforms, the paper’s contributions provide insights into optimizing user-facing selections when two-sided preferences are unknown and learned over time with delayed feedback. While CAPTCHA primarily focuses on differentiating humans from bots, understanding dynamic assortment and interaction models between users and providers helps frame sophisticated multi-agent learning scenarios on a platform. The two-way learning approach illuminates challenges when platform managers only receive delayed signals, analogous to delayed bot detection feedback, suggesting the importance of algorithms that carefully balance exploration and exploitation in adversarial or uncertain multi-actor environments. Recognizing that standard one-sided learning assumptions can lead to suboptimal decisions underscores the necessity of designing defenses and decision systems robust to complex, interacting unknown behaviors. However, the model assumes rational MNL-based behavior rather than adversarial attackers, so direct application to bot detection requires extensions incorporating adversarial dynamics. Still, the regret minimization framework and the handling of incomplete, delayed signals could inspire new designs for CAPTCHA triggering policies or trust scoring on multi-sided services.

Cite

bibtex

@article{arxiv2606_11118,
  title={ Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides },
  author={ Rahul Roy and Nur Sunar and Jayashankar M. Swaminathan },
  journal={arXiv preprint arXiv:2606.11118},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.11118}
}

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​