Regret Minimization with Adaptive Opponents in Repeated Games
Source: arXiv:2606.06486 · Published 2026-06-04 · By Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang
TL;DR
This paper addresses regret minimization in repeated games against adaptive opponents -- those who respond based on the entire history of play. The classical external regret framework, which treats losses as fixed sequences, cannot capture such opponent adaptivity, causing convergence to suboptimal equilibria. To solve this, the authors introduce a new metric, Repeated Policy Regret (RP-Regret), which measures regret relative to the best-in-hindsight strategy accounting for opponents' adaptive responses. RP-Regret generalizes prior notions by allowing dynamic comparator strategies and adaptive opponents, making it more native to repeated-game settings. The paper identifies necessary conditions for sublinear RP-Regret, including bounds on comparator variation and imperfect recall (finite memory with exponentially decaying dependence on past histories). To handle the inherently non-convex optimization, they propose three algorithmic approaches: (i) a non-convex optimization oracle; (ii) minimizing a convex, linearized surrogate called Local RP-Regret via projected gradient descent; and (iii) reformulating the repeated game as a Markov game with occupancy-measure-based convex optimization when opponents change slowly. Crucially, minimizing RP-Regret or its surrogate leads players to subgame perfect equilibria with improved cooperation, demonstrated experimentally on games like Stag-Hunt. The work advances the practical learning of equilibria in repeated games involving adaptive strategic opponents, beyond classical regret frameworks.
Key findings
- RP-Regret accounts for opponents' adaptive responses to history, unlike classical external regret, and can identify cooperative equilibria (e.g., tit-for-tat in Iterated Prisoner's Dilemma) yielding higher utilities (0.6 vs 0.2 utility).
- Necessary conditions for sublinear RP-Regret include the comparator's strategy variation growth as O(T^p) for p<1, and players exhibiting imperfect recall modeled as exponential decay memory (Condition 3).
- Under these conditions, a non-convex optimization oracle can achieve sublinear RP-Regret (proof in Appendix G).
- Local RP-Regret (LRP-Regret), a convex surrogate measuring one-step deviations, can be minimized by projected gradient descent with regret bound Rlocal_T / T ≤ Õ(|A|^(m+1) sqrt(PT/T) + C_γ^m), where m=O(log 1/ε) history length, PT comparator variation.
- Reformulating repeated games as Markov games with bounded memory allows occupancy-measure linear programming methods to minimize RP-Regret under slowly changing opponent strategies (also satisfying Condition 1).
- All proposed methods converge to equilibria with improved social welfare (higher utility), outperforming classical no-regret algorithms which converge to low-utility defect-defect equilibria.
- Occupancy measure constraints were designed to prevent error accumulation and ensure the regret-minimizer’s strategy matches empirical opponent behavior at each timestep.
- Experimental validation on Stag-Hunt shows regret minimization per RP-Regret leads to more cooperative, higher reward outcomes.
Threat model
Adaptive opponents with full capability to respond strategically and condition their actions on the entire history of play. They have bounded recall (imperfect memory modeled as exponential decay), but can adapt within that memory window arbitrarily. The players cannot control or predict opponents' exact strategies but know that deviations induce responses. Opponents cannot have unbounded perfect recall or change arbitrarily fast beyond bounded variation constraints. The regret minimizer aims to learn strategies minimizing regret accounting for such adaptivity.
Methodology — deep read
Threat Model and Assumptions: The adversary consists of adaptive opponents in repeated games who choose strategies responsive to the entire observed history of play. The players know that deviations will trigger opponent responses. Unlike classical no-regret learning assuming static or oblivious opponents, this work models opponents as strategic and history-dependent with finite memory exhibiting exponential decay recall (Condition 3). Comparator strategies are allowed to vary over time but with sublinear total variation (Condition 1). Opponents also satisfy bounded-memory assumptions to ensure tractability.
Data and Preprocessing: The setting is fully theoretical without external datasets. The authors test their algorithms empirically on classical repeated games like Iterated Prisoner's Dilemma and Stag-Hunt, which have well-known payoff matrices and standard utility structures. No data preprocessing is needed.
Architecture / Algorithm:
- RP-Regret metric compares the cumulative expected loss of the actual joint strategies to the minimum loss achievable by a dynamic comparator strategy that can adapt to history and opponent responses.
- RP-Regret is non-convex because the expected loss depends on products of past strategies and distributions over histories.
- To overcome this, three methods are proposed: i) Oracle-Based: Assuming access to a non-convex optimization oracle, one can minimize RP-Regret directly if Conditions 1 and 3 hold. ii) Local RP-Regret: Introduce a convex surrogate that considers only local (one-step) deviations from current strategies. Expected losses become linear in the deviation, enabling projected gradient descent updates. iii) Markov Game Reformulation: Model the repeated game with M-bounded memory as a Markov game. Use occupancy measures as decision variables to convexify the expected time-average loss, incorporating online convex optimization with time-varying constraints to ensure the regulated players’ strategy marginals align with observed opponent strategies and prevent error accumulation.
Training Regime: The algorithms operate over T rounds/iterations of repeated game play. The Local RP-Regret uses learning rates η=Θ(√(PT/T)) and history window m=O(log(1/ε)) to guarantee regret bounds. No seed strategy is discussed as these are online, adaptive optimization algorithms rather than fixed training runs.
Evaluation Protocol: Metrics include average RP-Regret, Local RP-Regret, total expected loss, and resulting utilities at equilibrium. Baselines include classical external regret minimization algorithms. Experimental slices highlight differences in equilibrium reached (defect-defect vs tit-for-tat in IPD). Theoretical convergence rates for regret are provided with order notation including dependencies on history length and action space size. No fold-based cross-validation is relevant due to the theoretical nature.
Reproducibility: Code and exact implementation details are not specified in the paper. The theoretical results appear sufficiently detailed to allow reproduction by expert readers. Experiments use standard repeated games to illustrate results rather than specialized datasets needing access or privacy concerns.
Example End-to-End: Consider Iterated Prisoner's Dilemma with both players minimizing classical external regret – the outcome converges to defect-defect with 0.2 utility. The paper shows that tit-for-tat achieves linear external regret but sublinear RP-Regret. Using the Local RP-Regret algorithm with projection updates on strategy distributions conditioned on recent history, players gradually adjust to tit-for-tat-like strategies, maintaining cooperation, and converge to equilibria with utilities around 0.6. This demonstrates RP-Regret's ability to identify adaptive cooperative equilibria overlooked by classical regret approaches.
Technical innovations
- Definition of Repeated Policy Regret (RP-Regret) capturing regret against adaptive, history-dependent opponents with dynamic comparator strategies in repeated games, addressing limitations of external and policy regret notions.
- Identification of necessary conditions for sublinear RP-Regret including bounded comparator variation (Condition 1) and imperfect recall modeled as exponential decay memory (Condition 3), characterizing fundamental learning limits.
- Development of Local RP-Regret, a convex linearized surrogate of RP-Regret enabling efficient projected gradient algorithms with formal regret bounds.
- Reformulation of repeated games with bounded memory as Markov games and use of occupancy-measure-based online convex optimization with novel time-varying constraints to directly minimize RP-Regret under slowly changing opponents.
Baselines vs proposed
- Classical external regret minimization: Iterated Prisoner's Dilemma average utility = 0.2 (defect-defect equilibrium) vs RP-Regret minimization (e.g., tit-for-tat) utility = 0.6
- Local RP-Regret projected gradient descent: regret bound Rlocal_T / T ≤ Õ(|A|^{m+1} sqrt(PT / T) + C_γ^m) vs no guarantee for classical external regret methods in adaptive opponent setting
- Markov game occupancy measure algorithm: convergence to subgame perfect equilibria under bounded opponent variation vs no convergence guarantees for classical approaches in adaptive repeated games
Limitations
- Minimization of RP-Regret is non-convex and thus computationally intractable in general without access to a non-convex optimization oracle.
- The local surrogate (Local RP-Regret) scales exponentially in the history window length m, limiting practical applicability to games with very large action spaces or memory requirements.
- The Markov game approach requires opponents to change strategies slowly (Condition 1 applied to opponents), which may not hold in highly volatile settings.
- The theoretical analysis focuses on bounded-memory strategies with exponential decay assumptions, which may not capture all real opponent behaviors.
- Experimental evaluations are limited to canonical small repeated games (e.g., Stag-Hunt, IPD); scalability and performance on larger real-world strategic interactions remain to be studied.
- The approach requires players to have some control or knowledge of their memory decay parameters and comparator variation bounds, which can be challenging in practice.
Open questions / follow-ons
- Can algorithms for minimizing RP-Regret be scaled and adapted efficiently for large action/state spaces and longer memory without exponential blowup?
- How robust are RP-Regret minimization methods against fully adaptive opponents who may violate the slow-change or bounded memory assumptions?
- What are the implications of RP-Regret minimization in multi-agent settings beyond two-player, e.g., large N-player games with complex interaction networks?
- Can RP-Regret concepts be extended to stochastic or partially observable repeated games with incomplete information?
Why it matters for bot defense
Bot-defense and CAPTCHA systems often face adversaries capable of adaptive, history-dependent strategies that traditional external regret models cannot adequately capture. This paper’s novel regret framework (RP-Regret) and algorithms provide a rigorous way to model and defend against such adaptive adversaries over repeated interactions. By accounting for adversaries’ memory and adaptivity, defense strategies can better anticipate and minimize exploitability by bots that condition actions on prior system responses. The occupancy measure-based approach and local linearized regret minimization offer promising algorithmic tools for designing adaptive challenge-response policies that converge to equilibria resilient against dynamic attack behaviors. However, the computational costs and assumptions about adversary memory dynamics require careful consideration when transferring to large-scale practical bot defense. Overall, the paper advances the theoretical foundation for learning equilibria under strategic, adaptive attacks, a key challenge in CAPTCHA and bot mitigation.
Cite
@article{arxiv2606_06486,
title={ Regret Minimization with Adaptive Opponents in Repeated Games },
author={ Mingyang Liu and Asuman Ozdaglar and Tiancheng Yu and Kaiqing Zhang },
journal={arXiv preprint arXiv:2606.06486},
year={ 2026 },
url={https://arxiv.org/abs/2606.06486}
}