Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Source: arXiv:2605.08053 · Published 2026-05-08 · By Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat

TL;DR

This paper addresses a foundational gap in risk-sensitive reinforcement learning: the absence of principled, model-free, value-based (Q-learning-style) algorithms for optimizing exponential utility in discounted infinite-horizon MDPs. Prior work on exponential utility in MDPs was largely confined to the average-reward setting or relied on policy gradient methods that lack guarantees of convergence to the optimal stationary policy. The authors build on the Bellman-type consistency equation from Porteus (1975), which imposes inter-stage consistency to avoid time-inconsistency, and derive two distinct Q-value-style operators — T (operating in log-space, an L∞ contraction) and F (operating in the positive orthant, a contraction in the Thompson/sup-log metric). They prove that the fixed points of these operators yield an optimal stationary policy for the exponential-utility objective, then construct two tabular model-free algorithms that converge to those fixed points.

The two proposed algorithms differ structurally in their timescale design. The first is a two-timescale algorithm targeting the fixed point of T: a fast auxiliary recursion tracks the inner conditional expectation (which appears inside a logarithm, preventing direct single-sample estimation), while a slow recursion updates the Q-estimates using a log-transform of the fast iterate. The second is a one-timescale algorithm directly targeting the fixed point of F, which admits unbiased single-sample stochastic estimates because the expectation appears linearly at the outermost level of F. Almost-sure convergence is established for both; finite-time rates of O(n^{-β/2}) are established for the two-timescale algorithm, and a scalar-only O(n^{-1/2}) rate (hiding log factors) is established for the one-timescale variant.

The convergence analysis for the one-timescale algorithm is the most technically novel contribution. Because F's contraction is in the sup-log metric, which is not compatible with additive stochastic approximation updates in the original coordinates, standard global contraction arguments fail. The authors instead prove stability via boundedness of iterates within a compact box K, then establish asymptotic stability of the limiting ODE using monotonicity, homogeneity (F(cx) = c^γ F(x)), and Dini-derivative comparison arguments against scaled fixed points — a technique that does not rely on a Lyapunov function in the usual sense.

Key findings

The operator T (eq. 5-6), derived from the Porteus (1975) Bellman-type equation, is a γ-contraction in the L∞ norm on R^{SA}, enabling direct stochastic approximation via timescale separation.
The operator F (eq. 2) is a γ-contraction in the sup-log (Thompson) metric d(x1,x2) = ||ln x1 - ln x2||∞ on R^{SA}_{++}, but this contraction is incompatible with additive updates in original coordinates.
The two-timescale algorithm achieves finite-time error E||Q_n - Q*||∞ = O(n^{-β/2}) for step-sizes α_n = (n+1)^{-α}, β_n = (n+1)^{-β} with 1/2 < β < α < 1 in the i.i.d. generative-model setting; choosing β close to 1 yields rates approaching O(n^{-1/2}).
The one-timescale scalar finite-time analysis yields E|x_n/x* - 1| = Õ(n^{-1/2}), but the full vector finite-time rate is left open due to the absence of a globally usable contraction in additive coordinates.
In a two-state illustrative MDP (S={s, s_bad}, γ=0.9, θ=0.1, P(s_bad|s,risk)=0.01, r(s,risk)=1), the risk-neutral optimal policy selects 'risk' (Q*_rn(s,risk)≈0.917 vs Q*_rn(s,safe)≈0.826), while the exponential-utility operator T assigns Q*(s,risk)≈-47.59 vs Q*(s,safe)=0, reversing the policy to 'safe'.
Greedy policies induced by the fixed points x* of F (equivalently Q* of T) are proved optimal among all stationary Markovian policies (including randomized ones) for the exponential-utility objective — the first such guarantee for tabular model-free algorithms in discounted MDPs.
The two-timescale fast-tracking error satisfies E||g_n - exp(-θ/γ · T(Q_n))||∞ ≤ C_g · sqrt(β_n), establishing that the inner-expectation estimator converges at rate O(sqrt(β_n)).
No experiments or simulation results are reported; all contributions are theoretical (convergence proofs, finite-time bounds, structural characterization).

Methodology — deep read

THREAT MODEL AND PROBLEM SETTING: This is a theoretical RL paper, not a security paper. The 'adversary' in the MDP sense is nature/environment stochasticity. The MDP M = (S, A, P, r, γ) is finite, tabular, and the agent has fixed risk-aversion parameter θ > 0 (or equivalently θ_hat > 0). The agent is model-free: it does not know P or r directly, only observing transitions. The Porteus (1975) formulation is adopted specifically because it enforces inter-stage consistency, yielding a well-defined fixed-point structure and stationary optimal policies — unlike the naive formulation that directly exponentiates the discounted return, which leads to time-inconsistency and non-stationary optimal policies.

DATA AND SAMPLING: There are no real-world datasets. The algorithms are analyzed under two sampling regimes. (1) Markovian sampling: a behavior policy π_n generates a trajectory (s_n, a_n, s_{n+1}) where s_{n+1} ~ P(·|s_n, a_n). Assumption 1 (for two-timescale) and Assumption 2 (for one-timescale) require uniform ergodicity of the induced Markov chain and continuity of the transition kernel as a function of the current iterate, uniformly over the bounded iterate set. (2) Generative model (i.i.d.): for finite-time results, (s_n, a_n) is drawn i.i.d. from a distribution ν with λ = min_{s,a} ν(s,a) > 0. This simplification isolates the nonlinear operator effects from Markovian noise complications.

ARCHITECTURE — OPERATOR DESIGN: Two Bellman-type operators are introduced. The T-operator (eq. 5) maps R^{SA} → R^{SA} via T(Q)(s,a) = r(s,a) - (γ/θ) ln[Σ_{s'} P(s'|s,a) exp(-θ max_{a'} Q(s',a'))], which is an L∞ γ-contraction. The F-operator (eq. 2) maps R^{SA}{++} → R^{SA} via F(x)(s,a) = exp(-θ/γ · r(s,a)) · [Σ_{s'} P(s'|s,a) (min_{a'} x(s',a'))^γ], a γ-contraction in the sup-log metric. The two are related by the conjugacy in eq. (7): F(x) = exp(-θ/γ · T(-γ/θ · ln x)), so their fixed points are related by Q* = -γ/θ · ln x*. The key novelty of F is that its conditional expectation appears linearly at the outermost level, enabling unbiased single-sample Monte Carlo estimation; in T, the expectation is inside a logarithm, preventing direct unbiased estimation from a single transition.

ALGORITHM 1 — TWO-TIMESCALE (eqs. 8-9): Q_{n+1} = Q_n + α_n[-γ/θ · ln g_n - Q_n]; g_{n+1} = g_n + β_n · e_{s_n,a_n}[G(Q_n, s_n, a_n, s_{n+1}) - g_n(s_n, a_n)], where G(Q,s,a,s') = exp(-θ/γ · r(s,a) - θ · max_{a'} Q(s',a')). The fast iterate g_n ∈ R^{SA}{++} tracks the conditional expectation E[exp(-θ max Q(s',a')) | s,a] pointwise. The slow iterate Q_n tracks Q* via the plug-in -γ/θ · ln g_n. Step-sizes satisfy Σ α_n = Σ β_n = ∞, Σ(α_n² + β_n²) < ∞, and α_n/β_n → 0 (fast timescale dominates).

ALGORITHM 2 — ONE-TIMESCALE (eqs. 10-11): x_{n+1} = x_n + α_n · e_{s_n,a_n}[F_hat_n - x_n(s_n, a_n)], where F_hat_n = exp(-θ/γ · r(s_n,a_n)) · (min_{a'} x_n(s_{n+1},a'))^γ. The sample F_hat_n is an unbiased estimate of F(x_n)(s_n,a_n) given (s_n, a_n, s_{n+1}). Step-sizes satisfy Σ α_n = ∞, Σ α_n² < ∞. After convergence, Q* is recovered as -γ/θ · ln x*.

CONVERGENCE ANALYSIS — TWO-TIMESCALE: The proof uses Borkar (2008)'s ODE method with timescale separation. The fast recursion is shown (via martingale arguments) to track g*(Q_n) = exp(-θ/γ · T(Q_n)) with tracking error ||e_n||∞ → 0 a.s. Substituting this into the slow recursion yields a perturbed ODE Q_dot = T(Q) - Q, whose unique globally asymptotically stable equilibrium is Q* (by γ-contraction of T in L∞). Finite-time rates use step-sizes α_n = (n+1)^{-α}, β_n = (n+1)^{-β} (1/2 < β < α < 1), yielding E||Q_n - Q*||∞ ≤ C_Q^(1) exp(-((1-γ)/(1-α)) n^{1-α}) + C_Q^(2) sqrt(β_n) = O(n^{-β/2}).

CONVERGENCE ANALYSIS — ONE-TIMESCALE: The compact invariant set K = [C_ℓ, C_u]^{SA} is established (eq. 13), where C_ℓ = exp(-θ/(γ(1-γ)) ||r||∞), C_u = exp(θ/(γ(1-γ)) ||r||∞). F is locally Lipschitz on K (the map x → x^γ, γ ∈ (0,1) is not globally Lipschitz near 0, but K is bounded away from 0). The limiting ODE x_dot = D_{π_x}(t)[F(x) - x] is shown globally asymptotically stable at x* using the sandwich m(t) = min_i x_i(t)/x*_i, M(t) = max_i x_i(t)/x*_i, and Dini-derivative inequalities D^+ m(t) ≥ ℓ_K(m(t)^γ - m(t)) and D^+ M(t) ≤ u_K(M(t)^γ - M(t)), both driven toward 1 by the scalar ODE z_dot = z^γ - z. The scalar finite-time rate Õ(n^{-1/2}) is proved for the scalar analogue only; the paper explicitly acknowledges the full vector rate is open.

REPRODUCIBILITY: No code is released. No numerical experiments are conducted. The paper is entirely theoretical. The illustrative two-state MDP example (Section 2.1) can be reproduced analytically from the given parameters.

Technical innovations

First Q-learning-style algorithms with provable almost-sure convergence guarantees for exponential-utility optimization in discounted (infinite-horizon) tabular MDPs, extending the Porteus (1975) planning framework to the model-free RL setting where prior work used only policy gradient methods (Noorani et al. 2025, Enders et al. 2024) without optimal stationary policy guarantees.
Introduction of the F-operator (eq. 2) as a Q-value extension of the Porteus Bellman equation, proved to be a γ-contraction in the Thompson/sup-log metric and shown to admit unbiased single-sample stochastic estimates due to the linearity of the expectation at the outermost level — a structural distinction absent from the T-operator.
Novel asymptotic convergence proof for a stochastic approximation recursion driven by a sublinear power-law operator (x^γ, γ ∈ (0,1)) using Dini derivatives, monotonicity, and homogeneity arguments in place of standard global contraction or quadratic Lyapunov methods, applicable to settings where the contraction metric is incompatible with additive updates.
Finite-time convergence rate O(n^{-β/2}) for the two-timescale algorithm via timescale separation and propagation of the fast-tracking error bound E||g_n - g*(Q_n)||∞ = O(sqrt(β_n)) into the slow iterate, with the rate approaching O(n^{-1/2}) as β → 1.
Proof that the greedy stationary policy induced by the fixed points of T and F is optimal among all stationary Markovian policies (including randomized) for the exponential-utility objective in discounted MDPs, a result previously established only in the planning/model-based context.

Limitations

No numerical experiments whatsoever — all claims are theoretical; the algorithms' practical performance (convergence speed, sensitivity to θ, behavior with large state/action spaces) is entirely unvalidated empirically.
Finite-time rate for the one-timescale algorithm is proved only for a scalar analogue (Proposition 4.4), not the full vector recursion; the paper explicitly states the vector finite-time rate is an open problem due to lack of a globally usable contraction.
The finite-time results for the two-timescale algorithm are derived under the i.i.d. generative-model assumption (not Markovian sampling), leaving the practically relevant Markovian noise setting without explicit finite-time rates.
The analysis is restricted to tabular (finite S, finite A) MDPs; extension to function approximation, continuous state/action spaces, or deep RL settings is not addressed and would require fundamentally different techniques.
The risk-aversion parameter θ is fixed throughout; the paper does not address how to select θ, the sensitivity of convergence rates to θ, or the case of adaptive/learned risk aversion.
Optimality is proved only among stationary policies; the relationship between the Porteus formulation's optimal stationary policy and the globally optimal (possibly non-stationary) policy for the raw exponentiated discounted return objective is not discussed in depth.
No comparison to existing policy gradient methods for exponential utility (e.g., Noorani et al. 2025, Moharrami et al. 2022) in terms of sample complexity or practical convergence speed.

Open questions / follow-ons

Can finite-time convergence rates be established for the full vector one-timescale recursion under Markovian sampling, given that the power-law operator F lacks a globally usable contraction in additive coordinates?
Can these algorithms be extended to function approximation (linear or neural) settings analogous to deep Q-networks, and what structural properties of T and F survive under approximation errors?
What is the sample complexity comparison between the proposed Q-learning-style algorithms and policy gradient methods (e.g., Noorani et al. 2025) for exponential-utility optimization, and in which regimes does each dominate?
Can the one-timescale F-based algorithm be accelerated using variance reduction or momentum techniques (e.g., analogues of double Q-learning or prioritized replay) that preserve the positivity constraint x ∈ R^{SA}_{++}?

Why it matters for bot defense

This paper is a pure theoretical RL paper with no direct connection to CAPTCHA, bot detection, or web security. The exponential utility framework it develops is designed for risk-sensitive sequential decision-making, where rare but catastrophic outcomes (large negative rewards) are penalized more heavily than a risk-neutral agent would. A bot-defense engineer would find no immediately applicable technique here.

The indirect relevance, if any, lies in the possibility of framing bot-defense policy design as a risk-sensitive MDP: for example, a challenge-serving agent that must balance user friction (cost per legitimate user challenged) against the rare but severe outcome of a sophisticated bot bypassing defenses undetected. In such a framing, exponential utility (which accounts for all moments of the return distribution) might be preferred over expected-value optimization, which can systematically under-weight low-probability, high-damage bot breaches. However, the tabular, finite state/action space requirement of the current work makes it inapplicable to any realistic bot-defense deployment scenario without substantial extension to function approximation and continuous observation spaces. This paper is relevant to a research engineer who works on the theoretical foundations of risk-sensitive decision-making and might eventually inform more practical risk-aware bot-defense policy learning, but it offers no near-term engineering takeaway.

Cite

bibtex

@article{arxiv2605_08053,
  title={ Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs },
  author={ Gugan Thoppe and L. A. Prashanth and Ankur Naskar and Sanjay Bhat },
  journal={arXiv preprint arXiv:2605.08053},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.08053}
}

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​