Mem-$π$: Adaptive Memory through Learning When and What to Generate

Source: arXiv:2605.21463 · Published 2026-05-20 · By Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella et al.

TL;DR

This paper addresses the limitations of conventional memory-augmented large language model (LLM) agents which rely on static retrieval mechanisms from episodic memory banks or skill libraries. These retrieval-based systems often return guidance that is irrelevant, overly specific, or poorly aligned with the agent’s current context. The authors propose Mem-π, a novel adaptive memory framework where memory is modeled as a dedicated generative policy that decides both when to generate guidance and what guidance to produce, conditioned on the agent’s current context. This shifts memory from static retrieval to on-demand generation, enabling dynamic, concise, and context-aware hints that better support complex agent tasks.

Mem-π is trained via a two-stage process: experience distillation converts an offline experience bank into a parametric generative memory policy through supervised learning, and adaptation distillation fine-tunes this policy with reinforcement learning to optimize task success and regulate when generation is beneficial (including an abstention mechanism). Extensive evaluation across diverse agentic domains such as web navigation (WebArena, WorkArena), terminal tool use (LifelongAgentBench), and text-based embodied interaction (ALFWorld) demonstrates Mem-π consistently outperforms retrieval-based and prior RL-optimized baselines, achieving over 30% relative improvements on web navigation tasks while using fewer memory tokens. The model learns to abstain on easier tasks and generate adaptive guidance on harder tasks, leading to improved performance-efficiency trade-offs and better generalization across different downstream agents.

Key findings

Mem-π achieves +8.1 pp absolute improvement on WebArena over Stage 1 model, and +25.4 pp gain in CMS subdomain (Table 1).
Mem-π improves base agent success rate from 42.0% to 50.3% on WorkArena and from 85.3% to 91.6% on ALFWorld (Table 1).
Experience distillation (Stage 1) alone matches or exceeds RL baselines, e.g. 35.0% SR on WebArena vs Memory-R1’s 33.2%.
Decision-content decoupled RL objective with structured counterfactual rollouts improves performance by 4.8 pp on WebArena compared to vanilla GRPO (Table 2).
Visual observations yield a consistent +2.7 pp SR improvement on WebArena versus text-only input (Figure 3).
Mem-π calibration leads to 71% abstention rate on easiest WebArena tasks, and only 13% on hardest, aligning generation with task difficulty (Figure 4).
Mem-π transfers well to unseen stronger agents (GPT-5.4-mini), providing 16.0 pp SR gain vs 4.3 pp for RAG on WebArena (Table 3).
Generative memory reduces average memory tokens used by 31% compared to Stage 1 and 38% compared to Memory-R1, while improving SR by 8.1 pp on WebArena (Figure 5).

Threat model

The paper does not explicitly define a security adversary model. The 'adversary' is implicit as challenging or ambiguous task environments that could mislead memory-augmented agents. It assumes no active malicious manipulation of memory or input; rather, it focuses on mitigating harms from irrelevant or misleading memory guidance generated on uncertain or out-of-distribution contexts.

Methodology — deep read

The core idea frames memory as a generative policy πmem distinct from the downstream agent.

Threat Model & Assumptions: The adversary is implicit and limited to challenging the agent environment; no explicit adversarial attack or robustness evaluation is addressed in the paper. The method assumes the task environment provides a task success signal but no adversarial memory corruption.
Data: The offline experience bank E consists of pairs (x, m), where each x=(q, o) combines the task instruction q and environment observation o (textual descriptions and optionally screenshots), and m is guidance as textual hints derived from distilled decisive steps in agent trajectories. The bank is collected from existing agent interaction logs across four diverse benchmarks (WebArena, WorkArena, LifelongAgentBench, ALFWorld). Data is split into train/test sets following original benchmarks, with training experience used for supervised learning.
Architecture / Algorithm: πmem is a dedicated language or vision-language generative model (Qwen-2.5-7B-Instruct or Qwen-2.5-VL-7B-Instruct), separate from the downstream LLM agent. It outputs either an abstention token [ABSTAIN] or a generated guidance string prepended to the agent's context. The output y = d ⊕ m includes a decision token d ∈ {[GENERATE], [ABSTAIN]} and memory guidance m. Stage 1 experience distillation trains πmem to map from task contexts (q, o) to memory hints m via autoregressive supervised learning, maximizing log-likelihood of training guidance.

Stage 2 adaptation distillation finetunes πmem by reinforcement learning to optimize downstream task reward plus a length penalty regularizer for memory brevity. The authors design a decision-content decoupled reinforcement learning objective built on Group Relative Policy Optimization (GRPO). This trains πmem using structured rollout groups containing both abstain and multiple generate variants. They decompose the RL advantage into a decision-level advantage comparing abstain vs generate branches, and a content-level advantage ranking generated memory variants among themselves. A token-level gating mechanism applies the decision advantage to direction tokens and content advantage to memory tokens only if generating memory is beneficial, preventing harmful updates to guidance when abstention is preferred.

Training Regime: Stage 1 utilizes supervised learning on the offline experience bank with autoregressive maximum likelihood objective. Stage 2 adapts from Stage 1 weights, applying GRPO RL with paired structured rollouts sampling G=4 outputs per context. Special decision token embeddings for [GENERATE] and [ABSTAIN] are initialized symmetrically for balanced exploration. Training details: not fully specified in the truncated text, but involves multiple seeds, reward shaping, and length regularization hyperparameters.
Evaluation Protocol: Experiments evaluate task success rate (percent tasks solved) across four benchmarks with held-out tasks and test splits. Baselines include no-memory base agents, retrieval-based RAG and Mem0, and RL-learned retrieval policies Memory-R1 and MemRL. Evaluations include ablation studies on training stages and RL objective components, comparison of text-only vs vision-language inputs, analysis of learned abstention behavior correlated with task difficulty, cross-agent transfer by evaluating on unseen agents without memory retraining, and efficiency measured by memory token usage.
Reproducibility: Code and data release are not explicitly mentioned in the provided content; the offline experience bank (JEF-Hinter) and benchmarks are public or referenced. Models are based on Qwen and GPT variants but exact checkpoint or parameter counts for Mem-π are not detailed. The paper describes architecture and training algorithm clearly enough for replication given access to data and compute.

Example workflow (WebArena): given a multi-step web navigation task query q and screenshot o, stage 1 policy generates a candidate textual memory guidance m; stage 2 RL-trained policy compares generating m vs abstaining. If generating m improves downstream agent success, it outputs [GENERATE] plus m (e.g., instructions to click certain page elements). Otherwise, [ABSTAIN] is output letting the base agent proceed without added hints. The guidance is prepended to the agent input, influencing the agent’s subsequent actions and increasing overall task success.

Technical innovations

Modeling agent memory as a dedicated generative policy πmem that jointly learns when to generate guidance and what to generate, shifting memory from static retrieval to adaptive on-demand synthesis.
A two-stage training framework combining offline experience distillation with decision-content decoupled reinforcement learning adaptation optimizing both generation utility and abstention behavior.
Decision-content decoupled policy optimization via structured counterfactual rollout groups to disentangle learning signals for routing decisions and generated content, addressing imbalance between decision-token and content-token gradients.
Incorporation of an explicit abstention token enabling the generative memory policy to reliably skip guidance generation when it is unhelpful or potentially harmful.
Extension of Group Relative Policy Optimization (GRPO) with token-level advantage decomposition and ∆-gating to selectively update the decision and content parts of memory output.

Datasets

WebArena — 812 tasks — public web navigation benchmark (Zhou et al., 2023)
WorkArena — 33 task templates, multiple seeds — enterprise web navigation on ServiceNow platform (Drouin et al., 2024)
LifelongAgentBench (LAB) — 500 tasks for DB and 500 for OS subsets — terminal tool use benchmark (Zheng et al., 2025)
ALFWorld — 3,553 train, 134 test tasks — text-based embodied household tasks (Shridhar et al., 2020b)
JEF-Hinter — experience bank distilling interaction traces into memory hints — proprietary offline dataset from training agents

Baselines vs proposed

Base Agent: WebArena SR = 28.4% vs Mem-π full = 34.6%
RAG (retrieval augmented generation): WebArena avg SR = 33.6% vs Mem-π = 43.1%
MemRL (RL-optimized retrieval): WebArena avg SR = 36.5% vs Mem-π = 43.1%
Mem-π Stage 1 (supervised only): WebArena SR = 35.0% vs full Mem-π (with RL) = 43.1%
Ablation no Stage 1 init: WebArena SR = 37.9% (−5.2 pp from full)
Ablation no structured rollout: WebArena SR = 38.3% (−4.8 pp)
Text-only Mem-π vs Vision-Language Mem-π on WebArena: 40.4% vs 43.1% avg SR
Cross-agent transfer: Mem-π on Qwen2.5-7B agent WebArena +18.2 pp vs RAG +4.2 pp

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.21463.

Fig 2

Fig 2: Overview of Mem-π. We train the generative memory policy πmem in two stages. Experience Distillation

Fig 8

Fig 8: Sample experience entries drawn from the offline bank E used to train Mem-π, one per benchmark. Each entry

Fig 7

Fig 7: Abstention statistics. Left: final per-benchmark abstention rates after Stage 2 training. Right: abstention

Limitations

No explicit robustness or adversarial memory corruption evaluation conducted; threat model does not consider malicious attackers.
Evaluation limited to specific benchmarks involving web navigation, terminal commands, and text-based embodied tasks; may not generalize to all LLM agent domains.
Dependency on quality and coverage of offline experience bank; no detailed analysis of performance under sparse or noisy memory data.
Stage 2 reinforcement learning requires task success feedback signal, which may not be available in all real-world scenarios.
No ablation or comparison on model scale of the memory policy or downstream agent to assess resource-performance tradeoffs.
Method currently only demonstrated on large generative LLMs with specialized memory policies; applicability to smaller models or other modalities untested.

Open questions / follow-ons

How does Mem-π perform under adversarial manipulation of memory inputs or environment observations?
Can the generative memory policy be efficiently updated or fine-tuned online to incorporate new experience without offline distillation?
What are the trade-offs of scaling memory policy model size relative to the downstream agent on performance and efficiency?
Can the decision-content decoupled RL framework extend to multimodal or hierarchical memory outputs beyond text guidance?

Why it matters for bot defense

Bot-defense and CAPTCHA systems relying on automated agents or AI bots could benefit from Mem-π's approach to adaptive memory management. Unlike static retrieval-based memory that risks memorizing obsolete or context-mismatched hints, a generative memory policy that decides when and what to generate can provide context-aware, dynamic guidance to improve agent decision-making reliability. Especially for CAPTCHA-solving or evasion tasks where environmental context shifts rapidly, the ability to abstain from generating misleading memory and produce concise, relevant cues is valuable.

Integrating adaptive memory generation could also improve interpretability and traceability of agent actions in bot detection systems by clarifying when memory influence is applied. However, practitioners should consider the reliance on reinforcement feedback for training, which may not always be obtainable in real-time attack detection scenarios. The technique’s demonstrated generalization to stronger unseen downstream agents suggests it could be adaptable across various CAPTCHA-solving architectures.

Cite

bibtex

@article{arxiv2605_21463,
  title={ Mem-$π$: Adaptive Memory through Learning When and What to Generate },
  author={ Xiaoqiang Wang and Chao Wang and Hadi Nekoei and Christopher Pal and Alexandre Lacoste and Spandana Gella and Bang Liu and Perouz Taslakian },
  journal={arXiv preprint arXiv:2605.21463},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.21463}
}

Mem-$π$: Adaptive Memory through Learning When and What to Generate ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​