MemoryWAM: Efficient World Action Modeling with Persistent Memory

Source: arXiv:2606.20562 · Published 2026-06-18 · By Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu et al.

TL;DR

MemoryWAM addresses a core challenge in world action models (WAMs) for robotic manipulation: balancing robust long-term memory of past observations with computational efficiency during inference. Prior WAMs either rely on fixed-size sliding-window memory, which limits task performance on non-Markovian tasks dependent on long-range history, or maintain full historical context with costly latency and GPU memory usage growing linearly with trajectory length. MemoryWAM introduces a hybrid memory architecture inspired by human cognition comprising three components: a sliding window of recent frames for immediate control, event-boundary anchor frames preserving salient initial task context, and a small number of learned gist tokens that compactly summarize long-range history. This design reduces temporal complexity from O(N) to O(N/d), where d is a compression ratio, maintaining persistent memory efficiently.

Technically, MemoryWAM couples a pretrained video diffusion transformer encoding video latents with a separate action diffusion transformer generating future actions, both operating within a mixture-of-transformers framework. During inference, detailed recent and anchor frames are retained fully, while older frames are compressed via gist tokens that attend selectively to their corresponding frame and history, enabling compact yet task-relevant memory retrieval. On RMBench, a challenging long-horizon, memory-dependent robotic manipulation benchmark, MemoryWAM achieves an average success rate of 83.0%, outperforming strong baselines including LingBot-VA with full-history memory (78.2%), and efficient WAMs with sliding-window memory that underperform (<11%). It also shows significantly lower inference latency and GPU memory usage than full-history methods, scaling to long trajectories with up to 1600 frames. Real-world experiments on dual-arm robot tasks confirm these efficiency and performance advantages, demonstrating MemoryWAM's suitability for practical long-horizon memory-aware robotic control.

Key findings

MemoryWAM reduces inference latency and GPU memory usage's growth with trajectory length from O(N) to O(N/d), where d=15 is the compression ratio between full token and gist tokens per frame.
On RMBench, MemoryWAM achieves an average success rate of 83.0% across 9 tasks, outperforming LingBot-VA (78.2%) with full-history memory and sliding-window baselines π0.5 (10.4%) and FastWAM (5.9%).
In real-world robotic tasks (Shell Game, Look and Press), MemoryWAM achieves 18/20 and 15/20 successes, surpassing LingBot-VA (13/20 and 14/20) and observation-only π0.5.
Ablation studies show removing gist tokens causes the largest drop in task success (down to 40% average vs 92.5% with full MemoryWAM), confirming importance of compact long-term history.
Hybrid memory of short-term recent frames, event-boundary anchor frames, and gist tokens provides the best balance between maintaining task-relevant history and computational efficiency, outperforming full attention memory (lower policy performance despite more tokens).
Compared to other memory designs, hybrid memory achieves the same 87% success rate on the Press Button task as full attention but with substantially reduced inference latency and GPU memory use (Fig 4).
Even with 1600-frame sequences, hybrid memory remains more efficient than recurrent neural networks and test-time training-based mechanisms while maintaining accuracy.
Model size is approximately 6B parameters, combining a 5B video DiT and 1B action DiT, trained with combined video prediction and action denoising objectives.

Threat model

The paper is primarily focused on robotic manipulation through memory modeling rather than security or adversarial threats. The implicit adversary, if any, is the challenge of non-Markovian tasks where critical information is outside the recent observation window. No malicious adversary or attacker is modeled.

Methodology — deep read

Threat model & assumptions: MemoryWAM assumes a non-adversarial setting typical in robot manipulation, where the agent must make memory-dependent decisions over long horizons. The system must efficiently recall task-relevant past observations despite occlusions or delayed rewards. No explicit adversarial threat is modeled.
Data: The model is evaluated primarily on RMBench, a memory-dependent robotic manipulation benchmark comprising 9 dual-arm tasks with varying memory requirements. Each task is trained with 50 expert demonstration episodes, and performance is measured by success rate over 100 rollouts. Input observations come from three cameras concatenated into a mosaic image, encoded into latent tokens by a 3D causal video VAE. Actions and proprioception states are continuous 14-dimensional vectors.
Architecture: MemoryWAM uses a mixture-of-transformers (MoT) structure with two main components: (a) a pretrained video diffusion transformer (DiT) that encodes video frames into latent tokens and models dynamics, and (b) a separate action DiT that predicts action chunks conditioned on video representation and text instructions. The video DiT has 24 attention heads, 30 transformer blocks, hidden dimension 3072. The action DiT has 24 heads, 30 blocks, but reduced hidden dim 1024 yielding 1B parameters. The model incorporates a causal video VAE to compress visual input.
Training regime: Both video and action DiTs are trained using continuous flow-matching diffusion objectives with 1000 timesteps. Noise schedules are logit-normal distributions with different shifts for video and action. Gaussian noise data augmentation is applied to video latents during training for robustness. AdamW optimizer with learning rate 2e-4 and weight decay 0.01 is used, batch size 1 per GPU across 8 GPUs. Loss is a weighted sum of video prediction and action denoising MSE losses.
Hybrid memory mechanism: During inference, MemoryWAM builds a hybrid memory cache comprising (1) a sliding window of recent Nrecent=4 video frames with full tokens for high-fidelity short-term context, (2) a small fixed set Ninit=2 of event-boundary anchor frames preserved with full tokens for salient initial task state, and (3) compressed gist tokens (M=8 per frame) representing the long-range history per frame via learnable gist token vectors. Older frames' full tokens are evicted to reduce memory. The action DiT attends to this hybrid cache for memory-dependent action generation. This reduces the time and space complexity from O(N) to O(N/d) with d=15 compression ratio.
Evaluation protocol: Evaluation metrics include success rate per task on RMBench, inference latency (time per action chunk), and GPU memory cost measured on a per-layer basis for different memory mechanisms and sequence lengths. Baselines include observation-to-action model π0.5, efficient sliding-window WAM FastWAM, and full-history WAM LingBot-VA. Ablations analyze contributions of anchor frames, gist tokens, and sliding window components.
Reproducibility: The paper reports using pretrained Wan2.2-TI2V-5B video DiT and 3D causal video VAE models, with details on hyperparameters. The RMBench dataset is publicly referenced [1]. The code or frozen model weights are not explicitly stated as released. Specific seeds for training are not mentioned. Inference evaluation follows the hybrid memory masking to match training conditions.

End-to-end example: In the Press Button task, MemoryWAM processes a sequence of 1600 frames by preserving the last 4 recent frames with full tokens, 2 initial frames as anchors, and compresses the remaining 1594 frames into gist token embeddings of size 8 each. The video DiT updates a KV cache with this hybrid memory structure; the action DiT then denoises action tokens conditioning on this compact yet persistent memory to output a 16-step action chunk. This maintains task success rate at 87% while keeping inference latency and GPU memory usage low relative to full-history methods.

Technical innovations

Hybrid memory combining a sliding window of recent frames, event-boundary anchor frames, and learnable gist tokens to maintain persistent long-range history compactly.
A tailored attention mechanism enabling actions to attend jointly to high-fidelity short-term context, anchor frames, and compressed gist tokens within a mixture-of-transformers video-action diffusion framework.
Compression of long-range historical video frames into a small fixed number of learnable gist tokens per frame, reducing attention and KV cache complexity from O(N) to O(N/d), with compression ratio d=15.
Training with noisy latent mixing in video diffusion branch to improve robustness to imperfect conditioning at test time, bridging video prediction supervision and efficient inference without explicit video generation.

Datasets

RMBench — simulation dual-arm robotic manipulation benchmark with memory-dependent long-horizon tasks — public, referenced via arXiv:2603.01229

Baselines vs proposed

π0.5 observation-to-action model: average success rate = 10.4% vs MemoryWAM 83.0%
FastWAM efficient sliding window WAM: average success rate = 5.9% vs MemoryWAM 83.0%
LingBot-VA full-history WAM: average success rate = 78.2% vs MemoryWAM 83.0%
Press Button task success rate: full attention = 87%, hybrid memory = 87%, RNN memory = ~low (not exact), TTT memory = ~low
Inference latency at 1600 frames: hybrid memory < RNN and TTT memory < full attention (exact ms not specified)
Real-world Shell Game task: LingBot-VA 13/20 vs MemoryWAM 18/20 successes
Real-world Look and Press task: LingBot-VA 14/20 vs MemoryWAM 15/20 successes

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20562.

Fig 1

Fig 1: Overview. Prior WAMs typically face a memory-efficiency trade-off: sliding-window

Fig 2

Fig 2 (page 1).

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

MemoryWAM inherits limitations of video diffusion models, including limited semantic reasoning and understanding capacity.
The compression of long-term memory via gist tokens may omit fine-grained details necessary for highly precise or novel manipulations.
Model evaluation focuses on scripted benchmark tasks and limited real-world tests; generalization to highly unstructured or adversarial environments is untested.
Hybrid memory parameters (number of gist tokens, anchor frames, window size) are fixed; adaptive or learned memory management remains unexplored.
Code and pretrained weights are not explicitly released, potentially hindering reproducibility and wider adoption.

Open questions / follow-ons

Can hybrid memory architectures be extended or adapted to dynamically manage memory size or compression based on task demands or environment complexity?
How would integrating higher-level semantic or symbolic reasoning modules complement MemoryWAM’s diffusion-based visual dynamics to improve reasoning and generalization?
Would joint training of video and action diffusion with end-to-end policy gradients further improve manipulation robustness compared to purely supervised diffusion losses?
What are the effects of memory compression on long-term manipulation tasks involving unseen objects or drastic environment changes?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, MemoryWAM’s approach illustrates an effective strategy for long-term memory compression in sequential decision-making models, balancing efficiency and context retention. Although focused on robotic manipulation, the hybrid memory design of combining recent detailed context, salient event anchors, and compressed gist representations could inspire CAPTCHA challenges or detection models requiring handling of long behavioral histories or multi-step interactions without incurring prohibitive inference costs. The emphasis on hybrid memory mechanisms suggests ways to optimize model scaling with sequence length while maintaining essential contextual cues, a challenge relevant when modeling complex user behaviors or bot interactions over extended sessions. Furthermore, MemoryWAM’s method of selectively preserving key frames and compressing the remainder could translate into practical schemes for maintaining historical user context without privacy or latency trade-offs in CAPTCHA systems.

Cite

bibtex

@article{arxiv2606_20562,
  title={ MemoryWAM: Efficient World Action Modeling with Persistent Memory },
  author={ Sizhe Yang and Juncheng Mu and Tianming Wei and Chenhao Lu and Xiaofan Li and Linning Xu and Zhengrong Xue and Zhecheng Yuan and Dahua Lin and Jiangmiao Pang and Huazhe Xu },
  journal={arXiv preprint arXiv:2606.20562},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20562}
}

MemoryWAM: Efficient World Action Modeling with Persistent Memory ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​