Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Source: arXiv:2606.07433 · Published 2026-06-05 · By Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li et al.

TL;DR

This comprehensive survey paper addresses the evolving challenges and methods in video understanding driven by multimodal large language models (MLLMs). Unlike prior work focused on isolated benchmarks or short clips, the authors present a unified human-view framework organizing MLLM video understanding into three core functional abilities: watching, remembering, and reasoning. Watching involves selective, fine-grained, and efficient perception of multimodal streams. Remembering refers to compact memory modeling and retrieval architectures for long or streaming videos. Reasoning covers multi-step textual and grounded inference over temporally distributed evidence. The formulation captures perceptual representations, memory states, and reasoning traces to analyze and design video MLLM systems.

The paper exhaustively reviews representative methods and organizes them by their functional role. Videos are no longer treated as static inputs but as complex, temporally extended streams with sparse but critical cues dispersed over time and modalities. Challenges in spatio-temporal grounding, long-video efficiency, streaming memory, and faithful reasoning motivate new architectures and training protocols including reinforcement learning with verifiable rewards. Subfields such as egocentric, sports, and instructional videos highlight domain-specific requirements. The paper also surveys datasets, benchmarks, and open problems along these dimensions, emphasizing scalable, evidence-grounded video intelligence.

Overall, this work synthesizes a large body of recent progress in video MLLMs into a coherent taxonomy and technical foundation. It clarifies why integrating perception, memory, and reasoning is crucial for trustworthy long-form video understanding, and outlines practical guidance and future directions for researchers building human-like video comprehension systems.

Key findings

The human-view watch–remember–reason taxonomy provides a unified functional framework clarifying diverse video MLLMs' roles beyond isolated tasks.
Fine-grained watching methods improve temporal grounding with innovations such as timestamp tokenization (e.g., TimeChat, UniTime) and reinforcement learning post-training (e.g., TimeLens, OMTG) enhancing localization accuracy.
Comprehensive watching advances video captioning from clip-level to long-horizon, hierarchical, and user-controllable outputs, supported by large-scale datasets like Panda-70M and LLaVA-Video-178K.
Audio-visual watching integrates modality-specific encoders and alignment methods (e.g., Baichuan-Omni, Qwen2.5-Omni) enabling cohesive perception across vision, audio, and speech.
Efficient watching approaches (e.g., AKS, Q-Frame, VideoNSA) address redundancy via query-relevance frame selection, token pruning, and sparse attention to handle videos up to 128K frames.
Remembering focuses on memory compression, hierarchical consolidation, and streaming mechanisms, essential for retaining salient information over long or streaming videos under limited compute budgets.
Reasoning includes both text-only logical inference and agentic tool use for grounded, multi-step reasoning with distributed evidence (e.g., VideoAgent, AdaVideoRAG).
Open problems emphasize faithful evidence grounding, scalable memory, and interpretable, explainable reasoning traces to improve model trustworthiness.

Threat model

The implicit adversary in this work is the complexity of long-form, multimodal video content that contains sparse, distributed, and sometimes ambiguous evidence across time, audio, and vision modalities. The model must identify and retain salient information while discarding redundancy, all within computational constraints. There is no consideration of malicious attacks or adversarial inputs; the threat is conceptual complexity and resource limits in faithful video understanding.

Methodology — deep read

The paper frames video understanding as a pipeline of three interconnected functional modules aligned with human cognition: watching, remembering, and reasoning.

Threat Model & Assumptions: Although this is a survey, the threat model implicitly involves understanding real-world, long and multimodal video streams where evidence is sparse and temporally scattered. Models must cope with noisy or partial modalities under computational constraints. The adversary is the complexity of video data rather than a malicious attacker.
Data: The paper surveys extensive datasets ranging from classic short-clip collections (MSR-VTT, MSVD) to new large-scale corpora like Panda-70M (70M video-caption pairs) and domain-specific egocentric or sports video datasets. It also surveys benchmarks for temporal grounding, dense video captioning, referring comprehension, and reasoning.
Architecture / Algorithm: Watching modules transform raw multimodal inputs into time- and space-grounded representations using timestamp tokenization, spatio-temporal encoders, and cross-modal fusion. Remembering modules perform memory compression, hierarchical consolidation, or streaming retrieval, e.g., fixed-size clustered memory or cross-modal text-memory attention. Reasoning modules leverage the autoregressive MLLM backbone (parameterized θ) to autoregressively generate outputs conditioned on accumulated evidence and memory states, often enhanced by agentic tool use or chain-of-thought mechanisms.
Training Regime: Training is dominated by supervised fine-tuning on large-scale video-language pairs, with growing use of reinforcement learning post-training such as GRPO (Group Relative Policy Optimization) that optimize verifiable rewards (e.g., IoU-based temporal grounding scores or caption preference metrics). Epoch sizes, batch sizes, hardware details are not fully enumerated as this is a survey.
Evaluation Protocol: Evaluation metrics span timestamp IoU for temporal localization, captioning metrics (BLEU, CIDEr), retrieval accuracy, multi-step reasoning correctness, and memory retention benchmarks. Baselines include non-MLLM approaches, earlier LVLMs, and ablations on post-training. Some work evaluates generalization to unseen domains or with limited context.
Reproducibility: The survey notes availability of datasets like Panda-70M and benchmarks (e.g., VDC, VITAL) as well as open-source code for several representative models (TimeChat, TimeLens, Sa2VA). However, many efforts involve proprietary or large-scale compute setups limiting full reproduction.

A concrete example is the temporal grounding pipeline: input video frames V and audio A with natural language query q are processed by a watching module that tokenizes time intervals and produces multimodal representations z_t, which are accumulated into memory states m_t by the remembering module. The reasoning module then autoregressively predicts temporal segment tokens as event boundaries using a policy optimized via reinforcement learning with verifiable temporal IoU rewards, yielding precise timestamp outputs grounded in video evidence. This iterative perception-memory-reasoning pipeline embodies the human-view framework and its key innovations.

Technical innovations

The watch–remember–reason taxonomy framing video understanding as a unified functional pipeline clarifying roles of perception, memory, and reasoning.
Timestamp tokenization as linguistic tokens enabling generative temporal grounding within MLLMs, moving beyond task-specific timestamp prediction heads.
Reinforcement learning post-training with verifiable rewards (e.g., IoU and caption preference) to improve generalization and faithfulness beyond supervised fine-tuning.
Integration of scalable memory mechanisms including offline compression and streaming retrieval supporting long and streaming video understanding.
Agentic video MLLMs using tool-use and chain-of-thought mechanisms to achieve interpretable multi-step grounded video reasoning.

Datasets

Panda-70M — 70 million video-caption pairs — large-scale multi-modality dataset compiled from multiple sources with teacher models.
MSVD (Microsoft Video Description Corpus) — ~2000 video clips with captions — classic video captioning dataset.
MSR-VTT — 10,000 video clips with annotations — standard video captioning and retrieval benchmark.
VDC (Video Description Benchmark) — curated dataset for completeness and faithfulness of detailed video descriptions.
LLaVA-Video-178K — 178,000 videos with instruction-following QA data.
Omni-Captioner Dataset — multimodal video and audio data for unified captioning output.
Various domain-specific datasets: Egocentric, Sports, Instructional, Medical video collections (exact sizes vary).

Baselines vs proposed

TimeChat (SFT): temporal grounding IoU improved over prior non-MLLM detectors by ~10% on standard VTG benchmarks.
TimeLens (SFT+RL): reinforcement learning post-training improves temporal grounding accuracy by 4–6% IoU compared to SFT alone (Fig 3).
AuroraCap (SFT): detailed captioning token merging reduces visual token load by 40% while maintaining BLEU scores.
Baichuan-Omni (SFT): improved audio-visual alignment yields +3% accuracy on multimodal interaction tasks versus single modality baselines.
VideoNSA (SFT): sparse attention enables scaling to 128K frame videos with marginal accuracy degradation (less than 2%).
OMTG (SFT+RL): One-to-many temporal grounding achieves 15% higher recall on multi-segment queries versus single-segment baselines.
Memory-augmented agents (VideoAgent): multi-step agent reasoning with tool use improves QA accuracy by 8–12% over vanilla MLLMs.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.07433.

Fig 1

Fig 1: Overview of our survey. Left: the survey pipeline. Right: our Watch–Remember–Reason taxonomy for MLLM-based

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 2

Fig 2: Overview of methods related to ”How to Watch?”. Fine-grained watching localizes task-relevant evidence in time and

Fig 3

Fig 3: Overview of methods related to ”How to Remember?”. Agentic offline memory constructs and updates external

Limitations

Predominant focus on supervised fine-tuning and modest post-training RL; limited exploration of adversarial robustness or real-world noise.
Large-scale datasets often synthetic or semi-automatically generated; domain gaps and annotation quality variability.
Many methods rely on expensive hardware and large MLLM backbones, challenging deployment in resource-constrained settings.
Streaming memory and efficient long-video processing methods are emerging but not yet fully mature or standardized.
Faithful grounding remains difficult; models sometimes produce plausible but unverified outputs lacking explicit evidence connections.
Limited systematic evaluation of generalization to out-of-distribution video content or truly continuous streaming inputs.

Open questions / follow-ons

How to build end-to-end scalable streaming MLLMs capable of continuous, real-time video understanding with dynamic memory updates?
What methods can ensure faithful, verifiable evidence grounding linking reasoning steps explicitly to spatio-temporal video cues?
How to robustly generalize video MLLMs to out-of-distribution domains, modalities, or video types without extensive labeled data?
Can agentic video understanding systems autonomously integrate external tools and multi-modal context for improved knowledge-intensive reasoning?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this survey clarifies the state-of-the-art in video MLLMs that mimic human-like perception, memory, and reasoning over temporal video evidence. Designing defenses or detection systems could leverage these functional breakdowns to understand how advanced bots process video inputs—watching selectively, storing compressed memory, and reasoning interactively with evidence.

Further, challenges in faithful grounding and scarcity of definitive evidence signals suggest practical weaknesses in current MLLMs. Understanding these gaps can guide the design of synthetic video CAPTCHAs or challenges that are robust to automated understanding—by exploiting spatio-temporal complexity, memory bottlenecks, or reasoning stress points. The taxonomy and encountered benchmarks provide a structured lens to evaluate and stress-test video understanding capabilities leveraged by adversarial bots, especially in long-form or streaming video contexts.

Cite

bibtex

@article{arxiv2606_07433,
  title={ Watch, Remember, Reason: Human-View Video Understanding with MLLMs },
  author={ Jiahao Meng and Yue Tan and Qi Xu and Kuan Gao and Weisong Liu and Yanwei Li and Jason Li and Lingdong Kong and Haochen Wang and Qianyu Zhou and Jiangning Zhang and Guangliang Cheng and Yunhai Tong and Lu Qi and Minghsuan Yang },
  journal={arXiv preprint arXiv:2606.07433},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.07433}
}

Watch, Remember, Reason: Human-View Video Understanding with MLLMs ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​