Skip to content

SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control

Source: arXiv:2605.00787 · Published 2026-05-01 · By Stavros Orfanoudakis, Pedro P. Vergara

TL;DR

This work addresses limitations in reinforcement learning (RL) for continuous control where policy updates typically rely on local gradients that do not explicitly leverage the global geometry of the value function. The authors propose SAVGO, a novel RL algorithm that jointly learns a state-action embedding space shaped by action-value similarity via cosine similarity. In this space, pairs of state-action inputs with similar Q-values lie close (high cosine similarity), while dissimilar pairs have embeddings pointing in different directions. This learned geometry enables defining a similarity kernel over candidate actions during policy improvement, directing updates toward regions of higher value beyond local gradient steps.

SAVGO tightly integrates representation learning, value estimation, and policy optimization through a consistent geometric objective within an off-policy actor-critic framework, preserving scalability. Empirical evaluation on MuJoCo continuous control tasks demonstrates improved sample efficiency and higher final returns compared to strong baselines (PPO, TD3, SAC, TQC), particularly in high-dimensional settings like Humanoid and Walker2d. Ablations confirm the importance of the value-geometry learning and kernel-based similarity weighting for performance gains.

Key findings

  • SAVGO achieves best or among the best mean max evaluation returns across MuJoCo v5 tasks after 1M steps, e.g., Humanoid: SAVGO (7687 ± 319) vs TQC (6330 ± 227) and SAC (5351 ± 75).
  • Similarity-weighted policy updates yield faster early learning and reduced variance compared to SAC, TD3, PPO, and TQC, especially on high-dimensional problems (Fig. 4).
  • Ablation studies (Fig. 5) show the curvature parameter λ controlling the target cosine similarity mapping critically affects performance, with λ ≈ 1–1.5 balancing stability and ranking precision.
  • Policy improvement benefits from candidate sample set size K ≥ 64 actions, with no gains beyond K = 256–512 but increased compute cost (Fig. 6).
  • Freezing the encoder or using uniform kernel weights (removing geometry-based weighting) degrades performance, confirming learned value geometry is essential (Table 2).
  • SAVGO integrates a value-geometry loss with critic updates for stable embedding training, avoiding collapse or instability seen in prior bisimulation-based methods.
  • Training runtime is approximately 8 hours for 1M steps on Nvidia V100 GPUs, slower than baselines due to extra embedding/kernel computations but within practical bounds.

Methodology — deep read

  1. Threat Model & Assumptions: The adversary context is not the focus here as this is an RL algorithmic contribution rather than a security paper. The method assumes standard RL environments modeled as discounted Markov Decision Processes (MDPs) with continuous states and actions. No explicit adversary or robustness to adversarial inputs is discussed.

  2. Data: Benchmark environments used for evaluation are from the MuJoCo continuous control suite (version 5). Each environment run uses 1 million environment steps for training, with data collected via interaction using the current stochastic policy, stored in a replay buffer. Multiple random seeds (typically 5 or 10) are used to gather stable statistics. Observations are normalized but no other preprocessing details are mentioned.

  3. Architecture / Algorithm: SAVGO extends an off-policy actor-critic framework similar to Soft Actor Critic (SAC) with two critics and a stochastic actor.

  • A separate value-geometry encoder network z_ψ learns joint state-action embeddings in a normalized d-dimensional space.
  • The encoder is trained such that cosine similarity between embeddings predicts similarity in action-values Q(s,a).
  • Supervision uses pairs from mini-batches: each pair's normalized value gap Δ_{i,j} determines a target cosine similarity Y_{i,j} = 1 - 2*(Δ_{i,j})^λ.
  • The actor is updated by sampling an anchor action from the current policy and a candidate set of K actions per state.
  • Candidate embeddings' cosine similarities to the anchor define kernel weights (softmax with temperature ρ).
  • Weighted Q-values from candidate actions form a similarity-weighted value estimate, replacing the local pointwise Q in the actor loss.
  • Temperature ρ is annealed over time to sharpen weighting as training progresses.
  • Target networks for critic and encoder weights are updated via Polyak averaging for stability.
  1. Training Regime: Training proceeds with standard replay buffer sampling, minibatches of size B (unspecified exact), and gradient descent updates for critic, encoder, and actor networks.
  • Networks have two layers with 256 nodes each.
  • Candidate set size K varies between 64 and 256 depending on action dimensions.
  • Temperature ρ annealed cosinely from 0.75 to 0.05 over 200k steps.
  • Training hardware includes Nvidia V100 GPUs and Intel Xeon 48-core CPUs.
  • Seeds: 5 or 10 per experiment.
  1. Evaluation Protocol: Performance is evaluated on 8 MuJoCo environments covering low to high action-dimensional continuous control tasks.
  • Metrics: episodic return averaged over seeds, reporting mean and standard deviation.
  • Baselines: PPO, TD3, SAC, TQC from Stable Baselines 3.
  • Ablations include varying λ (curvature parameter), candidate set size K, encoder freezing, kernel weighting, and normalization strategies.
  • Results focus on max evaluation return over 1M steps and training curves.
  1. Reproducibility: The implementation code is publicly released at https://github.com/StavrosOrf/DistanceRL.
  • Details on hyperparameters and training schedules are provided in supplementary materials.
  • The MuJoCo suite is publicly available; training is open domain with random seeds documented.

Concrete example: During each actor update, for a state s_t, an anchor action ˆa_t ∼ π_θ is sampled, and K candidate actions {a_t^k} are drawn (e.g., from π_θ).

  • Compute embeddings z_ψ(s_t,ˆa_t) and z_ψ(s_t,a_t^k), then cosine similarities with the anchor.
  • Convert similarities to kernel weights via softmax with temperature ρ.
  • Calculate weighted critic Q-values using min over two critics.
  • Form weighted average of these Q-values, ˆQ(s_t,ˆa_t), and update policy parameters θ to maximize ˆQ(s_t,ˆa_t) (with entropy regularization).
  • Simultaneously update critics and encoder using batches sampled from replay. This procedure leverages the learned geometry to produce more informed policy gradient signals beyond local gradients at a single sampled action.

Technical innovations

  • Learning a normalized state-action embedding space where cosine similarity explicitly reflects action-value similarity, rather than generic representation similarity.
  • Deriving a geometry-aware policy improvement operator that performs similarity-weighted aggregation over sampled candidate actions, enabling policy updates guided by a kernel over action embeddings.
  • Combining representation learning, value estimation, and policy optimization into a unified geometry-consistent objective within an off-policy actor-critic framework.
  • Introducing a curvature-controlled mapping from normalized value gaps to cosine similarity targets to flexibly shape embedding geometry representations.

Datasets

  • MuJoCo v5 continuous control suite — ~1 million environment steps per task per run — publicly available

Baselines vs proposed

  • PPO: best evaluation return on Humanoid = 554 ± 62 vs SAVGO: 7687 ± 319
  • TD3: max episode return on Humanoid = 799 ± 266 vs SAVGO: 7687 ± 319
  • SAC: max episode return on Humanoid = 5351 ± 75 vs SAVGO: 7687 ± 319
  • TQC: max episode return on Humanoid = 6330 ± 227 vs SAVGO: 7687 ± 319
  • TQC: max episode return on HalfCheetah = 12114 ± 1006 vs SAVGO: 10545 ± 616 (competitive but slightly lower)
  • SAVGO improves sample efficiency and stability over SAC, TD3, and PPO on Walker2d and Humanoid (Fig. 4).

Limitations

  • SAVGO has increased computational cost, approximately 8 hours training time for 1M steps versus 2–4.5 hours for baselines due to encoder and kernel computations.
  • Experiments are limited to standard MuJoCo v5 benchmarks; broader generalization to more diverse or real-world tasks is untested.
  • No adversarial robustness or safety analysis is performed—focus is on improved optimization geometry rather than security.
  • Fixed hyperparameters for curvature λ and candidate size K require tuning per task for best results; sensitivity remains significant.
  • The method assumes availability of accurate critic Q-value estimates; the embedding relies on these for supervision, potentially limiting robustness in highly stochastic environments.

Open questions / follow-ons

  • How does SAVGO perform under distribution shifts or environmental perturbations affecting the learned geometry?
  • Can the value-geometry learning be combined with pixel-based or high-dimensional raw observation inputs for end-to-end training?
  • What is the impact of learned geometry on offline RL or batch RL scenarios with fixed datasets?
  • Could the approach be extended to discrete or hybrid action spaces, or multi-agent settings?

Why it matters for bot defense

While SAVGO is not directly related to CAPTCHAs or bot defense, its core insight—of learning and leveraging geometry-consistent similarity metrics in a complex, continuous action space—could inspire bot-defense models that need to evaluate similarity between user interactions or probe behaviors. The idea of embedding value-driven similarity and using kernel-weighted aggregation rather than local pointwise estimates could be adapted for behavioral similarity scoring in CAPTCHA puzzle solving or interactive authentication stages. However, the continuous control setting and policy optimization specifics differ substantially from CAPTCHAs, so application would require significant adaptation and domain-specific engineering.

Practitioners focused on bot defense and CAPTCHAs might take away the principle that shaping learned representations to directly reflect utility or value with well-calibrated similarity metrics can enable more effective, non-local decision updates and robust evaluations, which could improve detection algorithms or adaptive challenge selection mechanisms.

Cite

bibtex
@article{arxiv2605_00787,
  title={ SAVGO: Learning State-Action Value Geometry with Cosine Similarity for Continuous Control },
  author={ Stavros Orfanoudakis and Pedro P. Vergara },
  journal={arXiv preprint arXiv:2605.00787},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00787}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution