Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory

Source: arXiv:2604.07863 · Published 2026-04-09 · By Saman Forouzandeh, Kamal Berahmand, Mahdi Jalili

TL;DR

This paper addresses the challenge of retrieving relevant information from long, multi-modal web interaction histories in agentic web navigation tasks. Unlike traditional document retrieval, relevance here dynamically evolves with the task state and depends on heterogeneous input modalities (visual screenshots, HTML text, structured knowledge). Existing retrieval methods rely mainly on static similarity thresholds or fixed-size buffers, which fail to capture the task-adaptive and modality-specific temporal nature of memory relevance. The authors propose Adaptive Cross-Modal Graph Memory (ACGM), a framework that learns sparse, task-adaptive graph structures over agent histories by optimizing a neural relevance predictor via policy gradients using downstream task success as reward.

ACGM models modality-specific temporal decay rates (visual signals decay 4.3× faster than text), learns sparse connectivity (3.2 edges per node), and achieves efficient sub-linear O(log T) retrieval. Evaluated on three large-scale multi-modal web navigation benchmarks (WebShop, VisualWebArena, Mind2Web), ACGM substantially outperforms 19 strong baselines—including GPT-4o, dense retrievers, neural re-rankers, and graph-based methods—achieving up to 82.7 nDCG@10 (+9.3 points) and 89.2% Precision@10 (+7.7). The authors also provide detailed ablations showing that learned edge prediction, modality-specific temporal modeling, and graph sparsity are critical to performance and efficiency. Overall, the work introduces a new paradigm for multi-modal history retrieval by jointly learning retrieval structure with temporal and task adaptivity.

Key findings

ACGM achieves 82.7 nDCG@10 on WebShop, outperforming GPT-4o by +9.3 points and MAHA by +9.1 points (p < 0.001).
Learned graph sparsity reduces average node degree from 8.7 to 3.2 edges, enabling O(log T) retrieval with 4.8× speedup over flat dense retrieval.
Visual modality decays 4.3× faster than textual, with learned temporal decay rates λ_v = 0.47 vs. λ_x = 0.11, improving Precision@10 by 8.7 points compared to uniform decay.
Two-stage training (supervised edge pre-training plus policy-gradient fine-tuning) reduces gradient variance by 38% and yields +11.5 nDCG gain over supervised only.
Removing learned edges and using dense graphs drops performance by Δ = −16.6 nDCG@10; removing temporal decay drops by Δ = −14.0.
The model maintains high retrieval precision (86.3%) and low latency (14.7 ms at T=100) with only 2.1 GB index memory, compared to 48 ms and 6.3 GB for dense BLIP-2.
Retrieval quality strongly correlates with downstream task success (Pearson ρ = 0.93, R^2 = 0.991).
ACGM generalizes zero-shot across task and website distribution shifts without retraining.

Threat model

The work does not present a formal adversarial threat model; it assumes a cooperative agent aiming to retrieve relevant past observations from multi-modal histories for improved task success. There is no explicit adversary attempting to manipulate retrieval or corrupt memory. The adversary capabilities and restrictions are not defined because the focus is on learning effective adaptive retrieval rather than adversarial defense.

Methodology — deep read

The paper tackles retrieval in the context of multi-modal web navigation agents, aiming to find observations from a growing history H_t of visual (screenshots), textual (HTML text), and structured (knowledge graph) data relevant for the current task state at time t. The key challenges addressed are dynamic, task-dependent relevance that changes during navigation, heterogeneous modality-specific temporal decay, and retrieval efficiency with long histories.

The threat model involves agents navigating web tasks with histories up to 100+ steps. The adversary is not explicitly modeled; the focus is on improving the agent's internal retrieval for task success.

Data consists of three public benchmarks: WebShop (1180 product search tasks, avg 24.3 steps), VisualWebArena (910 tasks, avg 31.7 steps), and Mind2Web (2350 tasks across 137 websites, avg 7.3 steps). Expert demonstrations and relevance labels are provided by expert action sequences and temporal window annotations (5-step relevance window).

Architecture-wise, ACGM first encodes observations to frozen pre-trained embeddings (CLIP for visuals, RoBERTa for text) projected jointly to 512-dim vectors. It learns a neural relevance predictor g_φ, a 2-layer MLP (512, 256, 1) with ReLU, that estimates relevance probability between observation pairs based on embeddings, cosine similarity, modality indicators, and temporal distance Δt.

Temporal decay rates λ_m for each modality m ∈ {visual, text, knowledge} are learned and regularized toward human annotation-grounded priors (λ_v=0.47, λ_x=0.11, λ_k=0.23). Decay modulates attention weights exponentially with time difference, enabling more realistic modeling of memory decay.

The learned relevance graph is constructed by applying a threshold P(relevant(i,j)) > 0.5 on g_φ outputs to define sparse edges—only ~3.2 edges per node—versus 8.7 in dense graphs. Older observations are organized hierarchically in a 4-ary tree via online k-means clustering for scalable O(log T) retrieval with beam search.

Training follows a two-stage approach:

Stage 1: Supervised pre-training of g_φ for 50K steps on labeled edge data derived from embedding similarity (>0.6 cosine) and action type similarity. Uses 12,500 expert demonstrations of avg length 24.3 steps. Training on 8xA100 GPUs (~18h). This seeds the model with realistic edges.
Stage 2: Policy-gradient optimization (REINFORCE) for 50K steps with task success binary reward and exponential moving average baseline for variance reduction. Fine-tunes g_φ and decay rates to maximize downstream success, discovering task-adaptive relevance. Uses AdamW optimizer, learning rate 10⁻⁵.

Evaluation uses held-out 3-fold cross-validation with multiple IR metrics (nDCG@10, MAP@10, Precision@10, Recall@10, MRR) plus downstream task success measurement. Statistical significance tested with paired t-tests, Bonferroni corrected across folds.

A concrete example: For a WebShop product search trajectory of 24 steps, ACGM constructs a sparse graph over all past observations, applies learned modality-specific decay and graph edges to identify a small subset of relevant historical screenshots, HTML texts, and knowledge triples, leveraging them to inform the action policy and successfully complete the purchase task. During training, g_φ is first supervised on expert edge labels, then fine-tuned by reinforcing edges that improve purchase success, iteratively improving graph connectivity.

Code and data splits are made publicly available, enabling reproducibility. The dataset benchmarks are mostly open. Model weights and exact seeds are not explicitly stated, but hardware and training time are detailed. Overall, the approach tightly couples the representation, learned temporal and graph structure, and training via task rewards to jointly optimize retrieval for interactive multi-modal web navigation.

Technical innovations

A learned neural relevance predictor optimized end-to-end via policy gradients from downstream task success to construct task-adaptive sparse retrieval graphs.
Modality-specific temporal decay rates for visual, textual, and knowledge modalities initialized from human annotations and fine-tuned during training.
Efficient hierarchical retrieval method organizing older observations in a 4-ary tree, enabling O(log T) time retrieval scaling for long interaction histories.
Two-stage training combining supervised edge pre-training with reinforcement policy-gradient fine-tuning to reduce variance and discover task-specific retrieval patterns.

Datasets

WebShop — 1,180 product search tasks — public benchmark with expert demonstrations
VisualWebArena — 910 real-world visual web navigation tasks — public benchmark with expert annotations
Mind2Web — 2,350 open-ended tasks across 137 websites and 31 domains — public benchmark with multi-split evaluation

Baselines vs proposed

GPT-4o: nDCG@10 = 73.4 vs. ACGM: 82.7 (+9.3)
MAHA (graph-based): nDCG@10 = 73.6 vs. ACGM: 82.7 (+9.1)
BLIP-2 Retrieval: Precision@10 = 77.1% vs. ACGM: 89.2% (+12.1%)
Dense Retrieval (flat): Latency at T=100 = 48 ms vs. ACGM: 14.7 ms (3.3× speedup)
Uniform temporal decay baseline: Precision@10 drop by 8.7 points compared to modality-specific decay in ACGM
Supervised only training (no policy gradient): nDCG@10 drops by 11.5 compared to full ACGM
Dense graph (no sparsity): nDCG@10 drops by 16.6 points
No temporal decay: nDCG@10 drops by 14.0 points

Limitations

Policy-gradient training relies on binary task success reward, which may limit fine-grained learning signals and could lead to credit assignment challenges despite variance reduction techniques.
While zero-shot transfer generalization across task and website distributions is demonstrated, the robustness under adversarial or out-of-distribution interactions remains untested.
The benchmarks involve relatively moderate trajectory lengths (avg 7-31 steps), leaving open how the approach scales qualitatively and quantitatively to very long histories beyond T=100 in complex real-world settings.
The approach requires large computational resources (8×A100 GPUs, ~40 GPU hours training) which may limit accessibility for smaller teams or real-time adaptation scenarios.
Human annotation-derived priors for temporal decay may not generalize across all domains or modalities; fine-tuning partly mitigates this but requires downstream task rewards.
Exact code and trained model reproducibility details are limited regarding random seeds and hyperparameter tuning beyond what is reported.

Open questions / follow-ons

How robust is ACGM retrieval under adversarial perturbations or noisy/corrupted observations in the multi-modal history?
Can the learned graph memory structure dynamically adapt online to changing task distributions or user intents without retraining?
How does the approach scale and maintain efficiency with histories orders of magnitude longer than 100 steps in open-ended web navigation?
Can fine-grained, shaped reward functions beyond binary task success further improve retrieval learning stability and precision?

Why it matters for bot defense

For bot-defense or CAPTCHA lifecycle assessment practitioners, ACGM offers insights into how multi-modal agent interaction histories can be efficiently and adaptively retrieved in a task-specific manner. Bot behaviors, often involving long action sequences and multi-modal signals, pose challenges for fixed retrieval or detection heuristics. The methodology here—learning sparse relevance graphs with modality-specific temporal decay optimized directly for downstream success—could inspire improved behavioral fingerprinting or anomaly detection systems for bots that rely on sequence context.

Furthermore, the idea of coupling task-level feedback to retrieval relevance suggests bot detectors might dynamically adapt which historical signals to prioritize based on evolving attack strategies. The efficient hierarchical retrieval with low latency supports real-time defense scenarios. While this work is not about adversarial robustness per se, its modular design and strong ablation insights provide a technical foundation to build adaptive, multi-modal history retrieval systems crucial for advanced bot defense and CAPTCHA challenge analysis.

Cite

bibtex

@article{arxiv2604_07863,
  title={ Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory },
  author={ Saman Forouzandeh and Kamal Berahmand and Mahdi Jalili },
  journal={arXiv preprint arXiv:2604.07863},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.07863}
}

Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​