StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

Source: arXiv:2606.06338 · Published 2026-06-04 · By Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge et al.

TL;DR

This paper addresses the challenge of deep video understanding (DVU) in VideoQA, which requires reasoning over long-range, complex storylines with multi-faceted question types and detailed story elements like specific characters, actions, and locations. Prior datasets for DVU have been limited in scale and scope due to the labor-intensive manual construction process and imbalanced coverage of question types and story elements. The authors improve an earlier automated question-answer generation system (StoryMind) by designing StoryMindv2, a multi-agent framework with a novel supervisor-guided QA generation mechanism and a multi-reviewer voting filtration process. This enables creation of StoryVideoQA, the largest DVU dataset to date with 363K QA pairs spanning 393.2 hours of diverse story videos (TV series and movies) and balanced coverage across 14 fine-grained question topics. Evaluations of 20 state-of-the-art VideoQA models on StoryVideoQA reveal significant performance drops compared to factoid datasets, exposing their inability to model long-range character relationships and complex storylines. To overcome this, the authors propose PlotTree, a novel hierarchical video understanding agent that organizes video content into a multi-level plot structure and enables efficient reasoning over the storyline’s long-range evolution, achieving superior QA accuracy. Overall, the paper contributes an improved data generation framework, a massive new DVU benchmark, and a promising new video reasoning method to advance research on story-level video understanding.

Key findings

StoryMindv2 improves QA generation accuracy from 77.37% to 85.96% on longer, more complex movies versus prior StoryMind (Fig 4).
StoryVideoQA dataset contains 363K QA pairs over 393.2 hrs of video (3 TV series with 1,635 s avg length and 78 movies with 7,878 s avg length) — largest DVU dataset to date (Table 1).
StoryVideoQA covers 14 fine-grained topics combining 2 question types (perception, inference) and 7 story element combinations (character, action, location) with balanced distribution (Gini index 0.927).
Evaluation of 20 SOTA VideoQA methods shows their accuracy drops by up to 48.9% on StoryVideoQA deep video understanding task relative to factoid VideoQA datasets (Fig 2).
PlotTree, a hierarchical LLM-driven agent, organizes videos into a recursive plot node tree for efficient long-range reasoning, outperforming baselines on StoryVideoQA (Specific metrics unclear).
Supervisor agent in StoryMindv2 framework reduces QA generation error rate by automatically detecting faults and providing corrective feedback, increasing accuracy while maintaining efficiency (Table 2).
Multi-reviewer voting filtration with 3 GPT-4 based reviewers improves final QA quality, balancing dataset scale and reliability (Table 3).
StoryVideoQA incorporates diverse video sources, including Friends, The Big Bang Theory, Game of Thrones, and 78 top-ranked movies from IMDB and Douban, broadening genre and complexity.

Threat model

n/a — The paper focuses on dataset construction and video understanding methodology improvements rather than security or adversarial issues. The assumed adversary is an underperforming VideoQA model that fails to maintain long-range character associations and coherent storyline reasoning.

Methodology — deep read

The authors tackle the DVU challenge of generating and evaluating question-answer (QA) pairs on long story videos where manual data creation is not scalable or balanced. They build StoryMindv2, an enhanced multi-agent collaboration framework consisting of four main stages: (1) Data preparation, (2) QA generation, (3) QA filtration, and (4) Difficulty measurement.

Threat Model & Assumptions: The task assumes an adversary as the baseline VideoQA models tested on the dataset, with no direct adversarial manipulations considered. The focus is on automated high-fidelity data generation and deep storyline reasoning rather than adversarial robustness.
Data: StoryMindv2 uses a large-scale corpus of time-aligned scripts and subtitles collected from public datasets (PAINS) and web sources for 3 TV series (Friends, The Big Bang Theory, Game of Thrones) and 78 top-rated movies (IMDB/Douban). Scripts provide rich contextual details, subtitles provide precise temporal alignment. Dynamic Time Warping (DTW) is applied for script-subtitle alignment and manual corrections are done to maintain fidelity. This results in 393.2 hours of video with aligned textual meta-data.
Architecture/Algorithm: StoryMindv2 features a two-agent generator-supervisor collaboration. The generator, powered by Gemini-2.0-flash LLM, produces QA pairs conditioned on video descriptions and fine-grained topic prompts (14 categories combining question type and story element combinations). A dynamic control mechanism deactivates QA generation tools for saturated topics to maintain balanced coverage. The supervisor agent inspects generated QA batches using a fault archive and retrieval-augmented generation (RAG) to identify and delete erroneous QAs and provide targeted feedback to the generator for iterative quality improvement.
Training Regime: The generation process is performed in batches; QA generation and supervision are implemented in tandem, improving accuracy progressively. Average time cost per QA is 0.365 minutes with supervision vs 0.081 without, still faster than manual generation ( > 2 minutes per QA). Detailed hyperparameters for generation, optimizer, or fine-tuning are not provided, as the approach primarily leverages pre-trained LLMs with prompt engineering.
Evaluation Protocol: QA quality is evaluated initially by manual verification on 2,000 generated QAs (accuracy metrics in Table 2). Dataset balance across 14 fine-grained topics is measured using Gini index and entropy (Table 1). The final StoryVideoQA dataset is used as a benchmark to evaluate 20 recent state-of-the-art VideoQA models, encompassing video language models (VLMs), multimodal large language models (MLLMs), and video understanding agents. The evaluation metrics include QA accuracy on multiple-choice tasks, with performance declines quantified relative to factoid VideoQA datasets.
Reproducibility: The paper references an open GitHub project page hosting StoryVideoQA dataset and related code. Some underlying raw data scripts and subtitles come from external public sources. Exact model weights for newly introduced PlotTree agent are not detailed but presumably included in code release. Some manual verifications and data alignments were conducted.

Example End-to-End: For a TV episode or movie, aligned script-subtitle pairs produce a video description used to prompt the generator to produce QA pairs constrained by fine-grained topic tools. The supervisor agent analyzes the batch output, utilizing fault archive retrieval to identify similar past mistakes, deletes invalid pairs, and feeds corrective feedback to the generator. Once enough balanced QAs are generated, multi-reviewer voting by three GPT-4 reviewers filters the QA set. The collected QA pairs constitute the StoryVideoQA dataset for benchmarking VideoQA models. Finally, PlotTree converts a long video into recursively clustered plot nodes forming a hierarchical tree that supports multi-level reasoning to answer deep inference questions on storylines.

Technical innovations

StoryMindv2 introduces a supervisor-guided QA generation mechanism that uses a fault archive and retrieval-augmented feedback to iteratively correct and improve generated QA quality for complex, long-range story videos.
A refined multi-reviewer GPT-4 voting filtration strategy replaces strict consistency filtering to maintain high QA quality while enabling large-scale, balanced dataset construction.
A dynamic, fine-grained topic control mechanism deactivates generation tools for saturated QA types ensuring balanced coverage across 14 perception/inference and story element question categories.
PlotTree: a novel LLM-driven video understanding agent that restructures long videos into a hierarchical plot tree for efficient multi-level storyline reasoning, enabling deeper comprehension over existing RAG-based flat retrieval approaches.

Datasets

StoryVideoQA — 363K QA pairs, 393.2 hours (3 TV series + 78 movies) — constructed by automated generation with human verification, public scripts/subtitles sources
FriendsQA — 44.6K QAs, 89.6 hours — prior TV series dataset generated by earlier StoryMind system
MovieQA — 14.9K QAs, 381 hours — publicly available movie QA dataset
TVQA — 144.9K QAs, 461 hours — public TV series QA dataset
TVQA+ — 29.4K QAs, 71.7 hours — public TV QA dataset

Baselines vs proposed

StoryMind QA generation: accuracy 77.37% vs StoryMindv2 with supervisor: 85.96% (Fig 4, Table 2)
Factoid VideoQA (NExT-QA) average accuracy ≈ 73.8-78.6% vs StoryVideoQA deep video understanding methods drop accuracy by up to 48.9% (Fig 2)
Gini index for StoryVideoQA QA topic balance at 0.927 vs prior DVU datasets ranging from 0.701 to 0.821 (Table 1)
Self-BLEU-2 (diversity proxy) lower for StoryMindv2 79.3% vs StoryMind 80.1%, indicating improved diversity
PlotTree outperforms 20 SOTA models on StoryVideoQA dataset in QA accuracy (specific numbers not detailed)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.06338.

Fig 1

Fig 1: Comparisons of factoid VideoQA and DVU.

Fig 2

Fig 2: VideoQA methods’ performance declines

Fig 3

Fig 3: The distribution of 5 datasets QAs across

Fig 4

Fig 4: Comparison of the accuracy in automatic

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The QA generation quality, though improved, still requires extensive manual verification for accuracy validation on sampled sets, indicating some residual errors.
The StoryVideoQA dataset, while large, may inherit biases or gaps from the underlying script and subtitle sources, which do not perfectly align with final video content.
Evaluations focus on multiple-choice QA accuracy; real-world open-ended or generative QA challenges remain unexplored.
PlotTree’s hierarchical reasoning depends heavily on accurate plot node extraction and clustering; sensitivity to errors in these steps is not deeply analyzed.
No explicit adversarial robustness or distribution shift testing was performed to assess model generalization under manipulated or domain-shifted story videos.
The reliance on large pretrained LLMs (Gemini-2.0-flash, GPT-4) requires substantial compute resources, limiting accessibility for some practitioners.

Open questions / follow-ons

How can automated generation frameworks like StoryMindv2 be extended to support open-ended, generative QA beyond multiple-choice formats?
What are the effects of noisy or misaligned script-subtitle data on the final QA quality, and can self-supervised alignment improve dataset fidelity?
How robust is PlotTree to variations in plot extraction accuracy, and can adaptive node clustering improve its reasoning over diverse genres?
Can similar multi-agent generation and hierarchical reasoning techniques be adapted for other long-range multimodal understanding tasks beyond VideoQA?

Why it matters for bot defense

This work is highly relevant to bot-defense and CAPTCHA practitioners interested in understanding sophisticated deep video comprehension and reasoning capabilities of AI systems. The StoryVideoQA dataset sets a new bar for evaluating models capable of tracking long-range narrative and character interactions, moving well beyond simple factoid or short-clip video classification tasks. For bot-defense, this highlights the increasing complexity needed in CAPTCHA or challenge-design to counter systems equipped with large-scale multimodal understanding.

The use of multi-agent generation and hierarchical reasoning (PlotTree) offers insights on how sophisticated automated agents can be structured to process and reason over complex narratives — a consideration for evaluating the trustworthiness and interpretability of video understanding models in security settings. However, the demonstrated performance drops by current SOTA models on this dataset suggest that deep storyline reasoning remains a challenging problem, meaning defenses based on requiring true story-level comprehension in video CAPTCHAs might still be robust in near-term deployment.

Cite

bibtex

@article{arxiv2606_06338,
  title={ StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset },
  author={ Zhengqian Wu and Zhixian Liu and Aodong Chen and Jingyang Zhang and Ruizhe Li and Hanlin Ge and Zhongyuan Wang and Chunxia Xiao and Chao Liang },
  journal={arXiv preprint arXiv:2606.06338},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.06338}
}

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​