M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Source: arXiv:2606.07402 · Published 2026-06-05 · By Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu et al.

TL;DR

M3Exam addresses the lack of realistic benchmarks for evaluating multimodal conversational memory in user-agent interactions where heterogeneous multimodal data accumulate over multiple sessions. Previous benchmarks mostly focus on human-human dialogues with sparse or static visuals and do not capture the challenge of reasoning over evolving multimodal file interactions or interpreting unstated user context. M3Exam offers a query-centric benchmark comprising 239 multi-session conversations across 15 personas, with over 3000 dialogue turns and nearly 1800 multimodal artifacts, paired with 5150 diverse evaluation queries spanning retrieval, multi-hop reasoning, implicit inference, and thematic domain knowledge. Evaluations of state-of-the-art multimodal large language models (MLLMs) and cutting-edge multimodal memory systems reveal persistent weaknesses in cross-modal grounding and implicit reasoning.

Key findings

M3Exam includes 239 multi-session conversations (15 personas) accumulating 1,799 multimodal artifacts and 5,150 evaluation questions.
34.9% of queries require cross-modal reasoning and file management; 14.9% require implicit inference from unstated context.
Strongest frontier MLLM (GLM-5.1) achieves only 54.9% overall F1 score, showing the benchmark's difficulty.
Agentic-memory systems improve over base backbones but still lag frontier MLLMs, highlighting challenges in multimodal grounding.
M3Proctor improves accuracy by 13% absolute over prior agentic-memory baselines at 0.484 overall LLM-judge score.
M3Proctor achieves this while reducing index construction time by 80% and tokens retrieved per query by over 70%.
Multimodal information lift reaches up to +0.30 F1 on multimodal reasoning questions, showing textual surrogates alone are insufficient.
M3Proctor’s modality cascade triggers on 68% of file management and 51% of multimodal reasoning queries, boosting accuracy selectively.

Threat model

n/a — This paper addresses multimodal memory and reasoning benchmarks for language agents rather than adversarial security or threats.

Methodology — deep read

The task formalized by M3Exam is to answer queries over an evolving multimodal conversational memory representing user-agent interactions across multiple sessions. Data provenance starts from a persona (biography), a core event narrative, and pools of multimodal files (images, documents, charts). An LLM generates chronologically ordered core and distractor events with attached queries and image keywords, audited by automated self-checks and human inspection. Events are synthesized into multi-turn conversations using a sliding window exposing event batches incrementally, producing 3,025 dialogue rounds grouped into 239 sessions.

Each session includes dialogues interleaved with multimodal files cumulatively visible up to the query turn. Metadata summaries are appended to memory history to facilitate long-term context. A typed question bank of 5,150 queries is generated with LLM prompts and human validation, spanning eight categories: memorizing (single-session retrieval, file management, factual judgment), reasoning (multi-session, multimodal, temporal), and interpreting (implicit inference, thematic reasoning). Each question is annotated with supporting-fact rounds across sessions and modalities, requiring retrieval and complex reasoning.

Baseline evaluation includes frontier black-box MLLMs answering from labeled evidence without external memory and agentic-memory systems built on the Qwen-2.5-VL-7B backbone that store and retrieve external memory. Agentic-memory baselines are text-only or multimodal retrieval-augmented generation (RAG) systems. Metrics include Exact Match, token-level F1, BLEU-1, and LLM-as-a-Judge score (using Qwen-2.5-VL-32B Instruct).

The method M3Proctor introduces modality-aware multimodal memory designed to address indiscriminate raw visual consumption in prior work. At index construction, each raw modality (image, document, chart) is projected into a textual surrogate with modality tags, including captions and transcription of chart data. This enables a text-retriever to operate on surrogates. At inference, M3Proctor detects query modality bias (which modalities are needed) with an instruction-tuned LLM, then re-ranks retrieved chunks by modality-biased scores favoring relevant modalities. A cascaded answering pipeline first attempts to answer using text surrogates alone and escalates to raw visual sources only if needed based on confidence tests and fused modality-evidence scores, thereby reducing token usage.

This cascade approach balances accuracy and efficiency, escalating for roughly 26% of queries. The evaluation compares raw and ablated variants on main metrics, derives modality lift of raw vs surrogate evidence, and analyzes efficiency tradeoffs for M3Proctor vs other memory methods. The paper supplies code and data for reproducibility.

One concrete example involves answering queries about espresso crema fading (photos and PDFs over several sessions), requiring the system to retrieve relevant previous multimodal files, interpret implicit user expertise, and reason over cross-modal evidence.

Overall, the methodology combines realistic multi-session user simulation, multimodal artifact curation, LLM-driven timeline and conversation synthesis, rigorous question construction, and a novel retrieval cascade to stress-test multimodal memory systems under practical user-agent scenarios.

Technical innovations

M3Exam: A comprehensive realistic benchmark for multimodal long-term conversational memory spanning multiple sessions, modalities, and implicit user context.
M3Proctor: A modality-aware multimodal memory system that detects query modality bias and escalates to raw visual sources only when surrogate text is insufficient.
Projection of raw modalities into compact textual surrogates with modality tags enables scalable text-based indexing and retrieval.
Modality-biased re-ranking of retrieved chunks prioritizes relevant evidence modalities to improve retrieval precision.
A cost-aware cascade answering pipeline that balances accuracy and token cost by selectively escalating to raw visual data on-demand.

Datasets

M3Exam — 239 multi-session conversations, 3,025 rounds, 1,799 multimodal artifacts, 5,150 evaluation questions — released at https://anonymous.4open.science/r/M-3-Exam-128D

Baselines vs proposed

GLM-5.1: Overall F1 = 0.5493 vs M3Proctor Overall LLM-J = 0.4838 (7B backbone)
Qwen-2.5-VL-7B base: Overall F1 = 0.4179 vs M3Proctor Overall LLM-J = 0.4838
Universal-RAG: Overall F1 = 0.4366 vs M3Proctor Overall LLM-J = 0.4838
RAG-Anything: Overall F1 = 0.4481 vs M3Proctor Overall LLM-J = 0.4838
MIRIX: LLM-J score = 0.4560 vs M3Proctor = 0.4838
Index build time: MIRIX = 43,271s vs M3Proctor = 72s; Tokens per query: MIRIX = 13,491 vs M3Proctor = 4,591

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.07402.

Fig 1

Fig 1: Overview of M3Exam.

Fig 2

Fig 2: Overall pipeline of M3Exam, designed to evaluate multimodal memory ability in realistic scenarios.

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

Benchmark conversations are synthetically generated from LLM simulations guided by persona narratives, which might lack some real-world conversational nuance.
The complexity of implicit inference relies on proxy LLM ability and design rather than fully natural implicit understanding.
Evaluation focuses on Qwen-2.5-VL-7B backbone; performance on larger or more diverse backbones is deferred to appendices but not fully detailed.
No explicit adversarial evaluation against sophisticated multimodal attacker strategies described.
Efficiency measurements do not cover real-time latency or resource constraints beyond token counts and index building time.
Some design choices such as modality bias detection thresholds and fusion weights may be sensitive to task and dataset specifics.

Open questions / follow-ons

How well do modality bias detection and cascade strategies generalize to unseen or out-of-domain multimodal tasks?
Can the synthetic persona-driven conversation generation be augmented with real user data to improve naturalness and challenge diversity?
What architectural innovations in model design could better encode long-term multimodal memory without relying heavily on external retrieval?
How does M3Proctor’s cascade approach perform under latency or resource-constrained inference settings in deployed agents?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, M3Exam offers a nuanced benchmark illustrating how long-term multimodal memory systems struggle with implicit inference and cross-modal grounding, two capabilities that adversarial bots might exploit to bypass text-based defenses. The proposed modality-aware retrieval and cascade inference approach in M3Proctor highlights the advantages of selective evidence consumption to reduce overhead while maintaining accuracy—an important principle for designing efficient real-world bot-detection agents that handle complex user interaction histories involving images or documents. Understanding such benchmarks and techniques can guide CAPTCHA designers in evaluating whether their challenge-response protocols sufficiently stress multimodal reasoning and implicit user intent inference, potentially improving robustness to sophisticated automated solvers. Moreover, the detailed taxonomy of question types (memorizing, reasoning, interpreting) and emphasis on implicit context offer insights into challenge formulation that stresses models’ multimodal comprehension beyond superficial text matching.

Cite

bibtex

@article{arxiv2606_07402,
  title={ M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions },
  author={ Zhengjun Huang and Wenxuan Liu and Zhoujin Tian and Wei Chen and Junle Chen and Yuqian Wu and Fangyuan Zhang and Qintian Guo and Xiaofang Zhou },
  journal={arXiv preprint arXiv:2606.07402},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.07402}
}

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​