Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Source: arXiv:2605.08053 · Published 2026-05-08 · By Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat
TL;DR
This paper addresses a foundational gap in risk-sensitive reinforcement learning: the absence of principled, model-free, value-based (Q-learning-style) algorithms for optimizing exponential utility in discounted infinite-horizon MDPs. Prior work on exponential utility in MDPs was largely confined to the average-reward setting or relied on policy gradient methods that lack guarantees of convergence to the optimal stationary policy. The authors build on the Bellman-type consistency equation from Porteus (1975), which imposes inter-stage consistency to avoid time-inconsistency, and derive two distinct Q-value-style operators — T (operating in log-space, an L∞ contraction) and F (operating in the positive orthant, a contraction in the Thompson/sup-log metric). They prove that the fixed points of these operators yield an optimal stationary policy for the exponential-utility objective, then construct two tabular model-free algorithms that converge to those fixed points.
The two proposed algorithms differ structurally in their timescale design. The first is a two-timescale algorithm targeting the fixed point of T: a fast auxiliary recursion tracks the inner conditional expectation (which appears inside a logarithm, preventing direct single-sample estimation), while a slow recursion updates the Q-estimates using a log-transform of the fast iterate. The second is a one-timescale algorithm directly targeting the fixed point of F, which admits unbiased single-sample stochastic estimates because the expectation appears linearly at the outermost level of F. Almost-sure convergence is established for both; finite-time rates of O(n^{-β/2}) are established for the two-timescale algorithm, and a scalar-only O(n^{-1/2}) rate (hiding log factors) is established for the one-timescale variant.
The convergence analysis for the one-timescale algorithm is the most technically novel contribution. Because F's contraction is in the sup-log metric, which is not compatible with additive stochastic approximation updates in the original coordinates, standard global contraction arguments fail. The authors instead prove stability via boundedness of iterates within a compact box K, then establish asymptotic stability of the limiting ODE using monotonicity, homogeneity (F(cx) = c^γ F(x)), and Dini-derivative comparison arguments against scaled fixed points — a technique that does not rely on a Lyapunov function in the usual sense.
Key findings
- The operator T (eq. 5-6), derived from the Porteus (1975) Bellman-type equation, is a γ-contraction in the L∞ norm on R^{SA}, enabling direct stochastic approximation via timescale separation.
- The operator F (eq. 2) is a γ-contraction in the sup-log (Thompson) metric d(x1,x2) = ||ln x1 - ln x2||∞ on R^{SA}_{++}, but this contraction is incompatible with additive updates in original coordinates.
- The two-timescale algorithm achieves finite-time error E||Q_n - Q*||∞ = O(n^{-β/2}) for step-sizes α_n = (n+1)^{-α}, β_n = (n+1)^{-β} with 1/2 < β < α < 1 in the i.i.d. generative-model setting; choosing β close to 1 yields rates approaching O(n^{-1/2}).
- The one-timescale scalar finite-time analysis yields E|x_n/x* - 1| = Õ(n^{-1/2}), but the full vector finite-time rate is left open due to the absence of a globally usable contraction in additive coordinates.
- In a two-state illustrative MDP (S={s, s_bad}, γ=0.9, θ=0.1, P(s_bad|s,risk)=0.01, r(s,risk)=1), the risk-neutral optimal policy selects 'risk' (Q*_rn(s,risk)≈0.917 vs Q*_rn(s,safe)≈0.826), while the exponential-utility operator T assigns Q*(s,risk)≈-47.59 vs Q*(s,safe)=0, reversing the policy to 'safe'.
- Greedy policies induced by the fixed points x* of F (equivalently Q* of T) are proved optimal among all stationary Markovian policies (including randomized ones) for the exponential-utility objective — the first such guarantee for tabular model-free algorithms in discounted MDPs.
- The two-timescale fast-tracking error satisfies E||g_n - exp(-θ/γ · T(Q_n))||∞ ≤ C_g · sqrt(β_n), establishing that the inner-expectation estimator converges at rate O(sqrt(β_n)).
- No experiments or simulation results are reported; all contributions are theoretical (convergence proofs, finite-time bounds, structural characterization).
Methodology — deep read
THREAT MODEL AND PROBLEM SETTING: This is a theoretical RL paper, not a security paper. The 'adversary' in the MDP sense is nature/environment stochasticity. The MDP M = (S, A, P, r, γ) is finite, tabular, and the agent has fixed risk-aversion parameter θ > 0 (or equivalently θ_hat > 0). The agent is model-free: it does not know P or r directly, only observing transitions. The Porteus (1975) formulation is adopted specifically because it enforces inter-stage consistency, yielding a well-defined fixed-point structure and stationary optimal policies — unlike the naive formulation that directly exponentiates the discounted return, which leads to time-inconsistency and non-stationary optimal policies.
DATA AND SAMPLING: There are no real-world datasets. The algorithms are analyzed under two sampling regimes. (1) Markovian sampling: a behavior policy π_n generates a trajectory (s_n, a_n, s_{n+1}) where s_{n+1} ~ P(·|s_n, a_n). Assumption 1 (for two-timescale) and Assumption 2 (for one-timescale) require uniform ergodicity of the induced Markov chain and continuity of the transition kernel as a function of the current iterate, uniformly over the bounded iterate set. (2) Generative model (i.i.d.): for finite-time results, (s_n, a_n) is drawn i.i.d. from a distribution ν with λ = min_{s,a} ν(s,a) > 0. This simplification isolates the nonlinear operator effects from Markovian noise complications.
ARCHITECTURE — OPERATOR DESIGN: Two Bellman-type operators are introduced. The T-operator (eq. 5) maps R^{SA} → R^{SA} via T(Q)(s,a) = r(s,a) - (γ/θ) ln[Σ_{s'} P(s'|s,a) exp(-θ max_{a'} Q(s',a'))], which is an L∞ γ-contraction. The F-operator (eq. 2) maps R^{SA}{++} → R^{SA} via F(x)(s,a) = exp(-θ/γ · r(s,a)) · [Σ_{s'} P(s'|s,a) (min_{a'} x(s',a'))^γ], a γ-contraction in the sup-log metric. The two are related by the conjugacy in eq. (7): F(x) = exp(-θ/γ · T(-γ/θ · ln x)), so their fixed points are related by Q* = -γ/θ · ln x*. The key novelty of F is that its conditional expectation appears linearly at the outermost level, enabling unbiased single-sample Monte Carlo estimation; in T, the expectation is inside a logarithm, preventing direct unbiased estimation from a single transition.
ALGORITHM 1 — TWO-TIMESCALE (eqs. 8-9): Q_{n+1} = Q_n + α_n[-γ/θ · ln g_n - Q_n]; g_{n+1} = g_n + β_n · e_{s_n,a_n}[G(Q_n, s_n, a_n, s_{n+1}) - g_n(s_n, a_n)], where G(Q,s,a,s') = exp(-θ/γ · r(s,a) - θ · max_{a'} Q(s',a')). The fast iterate g_n ∈ R^{SA}{++} tracks the conditional expectation E[exp(-θ max Q(s',a')) | s,a] pointwise. The slow iterate Q_n tracks Q* via the plug-in -γ/θ · ln g_n. Step-sizes satisfy Σ α_n = Σ β_n = ∞, Σ(α_n² + β_n²) < ∞, and α_n/β_n → 0 (fast timescale dominates).
ALGORITHM 2 — ONE-TIMESCALE (eqs. 10-11): x_{n+1} = x_n + α_n · e_{s_n,a_n}[F_hat_n - x_n(s_n, a_n)], where F_hat_n = exp(-θ/γ · r(s_n,a_n)) · (min_{a'} x_n(s_{n+1},a'))^γ. The sample F_hat_n is an unbiased estimate of F(x_n)(s_n,a_n) given (s_n, a_n, s_{n+1}). Step-sizes satisfy Σ α_n = ∞, Σ α_n² < ∞. After convergence, Q* is recovered as -γ/θ · ln x*.
CONVERGENCE ANALYSIS — TWO-TIMESCALE: The proof uses Borkar (2008)'s ODE method with timescale separation. The fast recursion is shown (via martingale arguments) to track g*(Q_n) = exp(-θ/γ · T(Q_n)) with tracking error ||e_n||∞ → 0 a.s. Substituting this into the slow recursion yields a perturbed ODE Q_dot = T(Q) - Q, whose unique globally asymptotically stable equilibrium is Q* (by γ-contraction of T in L∞). Finite-time rates use step-sizes α_n = (n+1)^{-α}, β_n = (n+1)^{-β} (1/2 < β < α < 1), yielding E||Q_n - Q*||∞ ≤ C_Q^(1) exp(-((1-γ)/(1-α)) n^{1-α}) + C_Q^(2) sqrt(β_n) = O(n^{-β/2}).
CONVERGENCE ANALYSIS — ONE-TIMESCALE: The compact invariant set K = [C_ℓ, C_u]^{SA} is established (eq. 13), where C_ℓ = exp(-θ/(γ(1-γ)) ||r||∞), C_u = exp(θ/(γ(1-γ)) ||r||∞). F is locally Lipschitz on K (the map x → x^γ, γ ∈ (0,1) is not globally Lipschitz near 0, but K is bounded away from 0). The limiting ODE x_dot = D_{π_x}(t)[F(x) - x] is shown globally asymptotically stable at x* using the sandwich m(t) = min_i x_i(t)/x*_i, M(t) = max_i x_i(t)/x*_i, and Dini-derivative inequalities D^+ m(t) ≥ ℓ_K(m(t)^γ - m(t)) and D^+ M(t) ≤ u_K(M(t)^γ - M(t)), both driven toward 1 by the scalar ODE z_dot = z^γ - z. The scalar finite-time rate Õ(n^{-1/2}) is proved for the scalar analogue only; the paper explicitly acknowledges the full vector rate is open.
REPRODUCIBILITY: No code is released. No numerical experiments are conducted. The paper is entirely theoretical. The illustrative two-state MDP example (Section 2.1) can be reproduced analytically from the given parameters.
Technical innovations
- First Q-learning-style algorithms with provable almost-sure convergence guarantees for exponential-utility optimization in discounted (infinite-horizon) tabular MDPs, extending the Porteus (1975) planning framework to the model-free RL setting where prior work used only policy gradient methods (Noorani et al. 2025, Enders et al. 2024) without optimal stationary policy guarantees.
- Introduction of the F-operator (eq. 2) as a Q-value extension of the Porteus Bellman equation, proved to be a γ-contraction in the Thompson/sup-log metric and shown to admit unbiased single-sample stochastic estimates due to the linearity of the expectation at the outermost level — a structural distinction absent from the T-operator.
- Novel asymptotic convergence proof for a stochastic approximation recursion driven by a sublinear power-law operator (x^γ, γ ∈ (0,1)) using Dini derivatives, monotonicity, and homogeneity arguments in place of standard global contraction or quadratic Lyapunov methods, applicable to settings where the contraction metric is incompatible with additive updates.
- Finite-time convergence rate O(n^{-β/2}) for the two-timescale algorithm via timescale separation and propagation of the fast-tracking error bound E||g_n - g*(Q_n)||∞ = O(sqrt(β_n)) into the slow iterate, with the rate approaching O(n^{-1/2}) as β → 1.
- Proof that the greedy stationary policy induced by the fixed points of T and F is optimal among all stationary Markovian policies (including randomized) for the exponential-utility objective in discounted MDPs, a result previously established only in the planning/model-based context.
Limitations
- No numerical experiments whatsoever — all claims are theoretical; the algorithms' practical performance (convergence speed, sensitivity to θ, behavior with large state/action spaces) is entirely unvalidated empirically.
- Finite-time rate for the one-timescale algorithm is proved only for a scalar analogue (Proposition 4.4), not the full vector recursion; the paper explicitly states the vector finite-time rate is an open problem due to lack of a globally usable contraction.
- The finite-time results for the two-timescale algorithm are derived under the i.i.d. generative-model assumption (not Markovian sampling), leaving the practically relevant Markovian noise setting without explicit finite-time rates.
- The analysis is restricted to tabular (finite S, finite A) MDPs; extension to function approximation, continuous state/action spaces, or deep RL settings is not addressed and would require fundamentally different techniques.
- The risk-aversion parameter θ is fixed throughout; the paper does not address how to select θ, the sensitivity of convergence rates to θ, or the case of adaptive/learned risk aversion.
- Optimality is proved only among stationary policies; the relationship between the Porteus formulation's optimal stationary policy and the globally optimal (possibly non-stationary) policy for the raw exponentiated discounted return objective is not discussed in depth.
- No comparison to existing policy gradient methods for exponential utility (e.g., Noorani et al. 2025, Moharrami et al. 2022) in terms of sample complexity or practical convergence speed.
Open questions / follow-ons
- Can finite-time convergence rates be established for the full vector one-timescale recursion under Markovian sampling, given that the power-law operator F lacks a globally usable contraction in additive coordinates?
- Can these algorithms be extended to function approximation (linear or neural) settings analogous to deep Q-networks, and what structural properties of T and F survive under approximation errors?
- What is the sample complexity comparison between the proposed Q-learning-style algorithms and policy gradient methods (e.g., Noorani et al. 2025) for exponential-utility optimization, and in which regimes does each dominate?
- Can the one-timescale F-based algorithm be accelerated using variance reduction or momentum techniques (e.g., analogues of double Q-learning or prioritized replay) that preserve the positivity constraint x ∈ R^{SA}_{++}?
Why it matters for bot defense
This paper is a pure theoretical RL paper with no direct connection to CAPTCHA, bot detection, or web security. The exponential utility framework it develops is designed for risk-sensitive sequential decision-making, where rare but catastrophic outcomes (large negative rewards) are penalized more heavily than a risk-neutral agent would. A bot-defense engineer would find no immediately applicable technique here.
The indirect relevance, if any, lies in the possibility of framing bot-defense policy design as a risk-sensitive MDP: for example, a challenge-serving agent that must balance user friction (cost per legitimate user challenged) against the rare but severe outcome of a sophisticated bot bypassing defenses undetected. In such a framing, exponential utility (which accounts for all moments of the return distribution) might be preferred over expected-value optimization, which can systematically under-weight low-probability, high-damage bot breaches. However, the tabular, finite state/action space requirement of the current work makes it inapplicable to any realistic bot-defense deployment scenario without substantial extension to function approximation and continuous observation spaces. This paper is relevant to a research engineer who works on the theoretical foundations of risk-sensitive decision-making and might eventually inform more practical risk-aware bot-defense policy learning, but it offers no near-term engineering takeaway.
Cite
@article{arxiv2605_08053,
title={ Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs },
author={ Gugan Thoppe and L. A. Prashanth and Ankur Naskar and Sanjay Bhat },
journal={arXiv preprint arXiv:2605.08053},
year={ 2026 },
url={https://arxiv.org/abs/2605.08053}
}