EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Source: arXiv:2605.15199 · Published 2026-05-14 · By Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

TL;DR

EntityBench addresses the critical challenge of maintaining consistent characters, objects, and locations across long multi-shot video sequences in automated video generation. Prior benchmarks either lacked multi-entity tracking, had limited episode lengths, or used insufficient metrics for evaluating entity consistency. EntityBench introduces a large-scale benchmark spanning 140 episodes and 2,491 shots sourced from real narrative media, with meticulous per-shot entity schedules covering up to 50 shots, 13 characters, 8 locations, and 22 objects. It pairs this dataset with a three-pillar multi-dimensional evaluation framework that separately measures intra-shot visual quality, prompt-following alignment, and, centrally, cross-shot entity consistency using embedding similarity and LLM-based criteria gated by fidelity to avoid rewarding static but incorrect entity renderings.

To establish baseline methods on EntityBench, the authors propose EntityMem, a novel memory-augmented video generation framework that maintains persistent, per-entity visual and textual reference banks generated and verified before video synthesis. This explicit entity memory enables retrieval of consistent entity appearances across shots without contamination from scene context or error accumulation. Experiments on 2,491 shots demonstrate existing methods degrade sharply in entity consistency with longer recurrence distances, while EntityMem achieves the highest character fidelity (Cohen's d = +2.33 compared to baselines) and presence metrics, though it trades some gains for fidelity on objects. This work thus sets a new standard for evaluating and approaching entity-consistent long-range multi-shot video generation.

Key findings

EntityBench consists of 140 episodes with 2,491 shots, covering 13 cross-shot characters, 8 locations, and 22 objects, spanning recurrence gaps up to 48 shots.
The benchmark uses a three-pillar evaluation framework with 51 metrics across intra-shot quality, prompt-following, and cross-shot consistency, including embedding and LLM pairwise judgments.
EntityMem, the proposed memory-augmented system, achieves the highest character presence (0.967 vs. 0.882 for best baseline HoloCine) and face fidelity (0.740 vs. 0.452 for next-best baseline StoryMem).
EntityMem obtains strong effect sizes on character fidelity intra-shot metrics (Cohen's d +1.71) and character presence (+1.23) relative to StoryMem.
On cross-shot LLM-based character consistency, EntityMem improves llm_face_accuracy 1.8x (0.406 vs. 0.226) and wins across all 6 character consistency metrics.
In contrast, EntityMem performs worse on object fidelity metrics (0.601 vs. 0.618) and DINOv2 embedding-based consistency (-0.50 Cohen’s d), reflecting a trade-off in focusing on character memory.
Visual quality metrics from Pillar 1 (intra-shot) favor holistic generation methods (CineTrans, HoloCine), but these drop sharply in prompt-alignment and entity fidelity compared to EntityMem.
Cross-shot consistency degrades sharply with recurrence distance in existing methods without explicit entity memory.

Threat model

The implicit adversary is the cumulative visual inconsistency arising in automated multi-shot video generation systems, which may lose or alter entity appearance details over long sequences. Adversaries do not manipulate data but rather the inability of generative models to maintain identity fidelity over multiple shots and intervening context. The system assumes no external attacker capable of direct model manipulation or prompt tampering but focuses on robustness to long-range entity recurrence and scene variation.

Methodology — deep read

Threat Model & Assumptions: The adversary here is implicit—it is the source of inconsistency in automated multi-shot video generation systems that fail to maintain persistent visual identity for entities (characters, objects, locations) across potentially long-range shot recurrences. The goal is to measure and improve robustness of entity consistency when generating sequences of video shots from text prompts.
Data: EntityBench is curated from real narrative media clips sourced from 100,000 raw clips. After visual quality filtering, 45,589 clips remain; then 606 episodes are extracted by filtering episodes with excessive subtitles or documentary style. A sliding window selects contiguous shot windows scored by cross-shot entity recurrence and interaction density, retaining 2,491 shots over 140 episodes (Table 2). Each shot is annotated with explicit per-shot entity schedules covering characters (987 total), objects (2,077), and locations (654). Entities are detected with face and object detectors, tracked and clustered into consistent identities using hierarchical clustering and LLM-based deduplication to resolve fragmented clusters.
Architecture / Algorithm: The key technical contribution is EntityMem, a memory-augmented generation framework. It builds a persistent per-entity memory bank holding verified visual references (portraits for characters, panoramic backgrounds for locations, objects selectively) generated and verified by a set of LLM-driven agents prior to video generation. At generation time, the video backbone model retrieves these references per-entity and composes shots by overlaying these on planned layouts. This disentangles entity appearance from scene context and temporally anchors identities to reduce accumulation of visual drift.
Training Regime: EntityMem does not modify base video generation model training but introduces a pre-generation stage where multi-agent LLMs generate and verify entity reference images and textual descriptions. During video generation, the video synthesis model receives prompts augmented with these cached entity memories. Hardware used includes two nodes with 8 NVIDIA L20 GPUs for full benchmark runs. No additional fine-tuning of video models is reported for EntityMem.
Evaluation Protocol: EntityBench’s evaluation is a novel three-pillar framework: (1) Pillar 1 assesses intra-shot quality on six VBench metrics (e.g., motion smoothness, aesthetic quality); (2) Pillar 2 measures prompt-following including presence of scheduled entities and entity fidelity via multimodal LLM criteria checking detailed attributes; (3) Pillar 3 tests cross-shot consistency using DINOv2 embedding similarity and LLM pairwise judgments comparing entity appearances shot-to-shot. A fidelity gate admits only those (shot,entity) pairs that pass Pillar 2's fidelity thresholds into Pillar 3 to avoid rewarding consistent but incorrect renderings. Comparisons include StoryMem, HoloCine, CineTrans baselines.
Reproducibility: Code and data for EntityBench and EntityMem are publicly released at https://github.com/Catherine-R-He/EntityBench/. The video generative backbones are open-sourced state-of-the-art models; no closed datasets are used besides the curated clips sourced from public narrative media.

Example Walkthrough: For a particular episode with multiple recurring characters, EntityMem first generates per-character portraits based on their first-appearance descriptions and verifies them with VLM and LLM agents. It stores these in a memory bank. When generating each shot, EntityMem composes keyframes positioning these portraits and injects per-entity descriptions into the prompt for video synthesis. The generated video frames are evaluated for intra-shot quality and entity presence; then fidelity-gated entity crops across shots are compared for embedding and LLM-based identity consistency, showing marked improvement over baselines.

Technical innovations

EntityBench introduces a large-scale multi-shot video generation benchmark with explicit per-shot schedules for multiple entity types (characters, objects, locations) tracked simultaneously across long episodes.
A three-pillar evaluation framework disentangles intra-shot quality, prompt-following alignment, and cross-shot entity consistency with 51 automated metrics combining embedding similarity and multimodal LLM judgments gated by fidelity.
EntityMem, a memory-augmented generation system, maintains persistent, per-entity visual and textual references generated and verified before video synthesis, enabling explicit recall of entity identity across shots.
Use of specialized LLM agents to manage stages of entity reference generation, layout keyframe composition, and shot-level memory-augmented generation for more consistent video narratives.

Datasets

EntityBench — 2,491 shots across 140 episodes — curated from real narrative media with LLM enrichment, public release at https://github.com/Catherine-R-He/EntityBench/

Baselines vs proposed

StoryMem: face_fidelity = 0.452 vs EntityMem: 0.740 (+63.7%)
StoryMem: character_presence = 0.849 vs EntityMem: 0.967 (+13.9%)
HoloCine: character_presence = 0.882 vs EntityMem: 0.967 (+9.7%)
CineTrans: character_presence = 0.796 vs EntityMem: 0.967 (+21.4%)
EntityMem: llm_face_accuracy = 0.406 vs StoryMem: 0.226 (+79.6%)
StoryMem: cs_face (DINOv2) = 0.792 vs EntityMem: 0.737 (-7.0%)
EntityMem: imaging_quality = 66.00 vs CineTrans: 68.57 (-3.75%)
EntityMem wins character intra-shot fidelity with Cohen’s d = +1.71 vs StoryMem
EntityMem trails on object fidelity (StoryMem 0.618 vs EntityMem 0.601)
EntityMem trails on embedding-based cross-shot consistency (Cohen’s d = -0.50 vs StoryMem)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.15199.

Fig 2

Fig 2: Qualitative comparison on a representative episode. Multiple characters recur in shots

Fig 24

Fig 24: DINOv2 similarity measures consistency in a different way from LLM.

Limitations

EntityMem improves character fidelity but at some cost to object fidelity and embedding-based cross-shot similarity, indicating tradeoffs in entity memory allocation.
Evaluation relies heavily on LLM-based judgment metrics; while comprehensive, this adds indirectness and potential bias in assessing visual consistency.
Dataset is curated and annotated from existing narrative media, which may limit diversity compared to fully synthetic or user-generated multi-shot prompts.
Recurrence gaps beyond 48 shots are not explored, so very long-range entity consistency remains untested.
Evaluation focuses on English language action and entity descriptions; performance in other languages or more abstract prompts is not studied.
The architecture and training are based on existing backbone models; EntityMem itself does not include end-to-end training, so gains may plateau without deeper integration.

Open questions / follow-ons

How to better integrate object memory management to overcome fidelity trade-offs observed in EntityMem?
Can multi-shot video generation models incorporate explicit entity consistency objectives during model training rather than inference-time memory augmentation?
How does entity consistency scale with even longer episodes exceeding 50 shots or with more complex interactions among many entity types?
What are the best ways to generalize multi-entity consistency evaluation beyond English narratives or scripted media?

Why it matters for bot defense

EntityBench and EntityMem provide a valuable framework for rigorously assessing and improving the consistency of visual entities across sequences, a property critical for generating coherent visual narratives resistant to spoofing or automated generation artifacts. For bot-defense practitioners designing CAPTCHAs or bot-detection mechanisms leveraging multi-frame or video challenges, maintaining consistent entity features across frames is essential. Inconsistencies are a common failure mode for generative bots attempting to mimic human-like video or animation patterns. EntityBench offers a standardized benchmark with detailed metrics to quantify these consistency attributes and can guide development of generation models with explicit memory mechanisms like EntityMem to reduce identity drift. Further, the modular three-pillar evaluation framework may inspire multi-dimensional scoring functions for CAPTCHAs evaluating user responses or bot synthetic outputs on prompt adherence and consistency criteria beyond single-frame verification. However, practical deployment requires consideration of latency and scalability given the complexity of long-shot memory management and LLM-guided verification.

Cite

bibtex

@article{arxiv2605_15199,
  title={ EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation },
  author={ Ruozhen He and Meng Wei and Ziyan Yang and Vicente Ordonez },
  journal={arXiv preprint arXiv:2605.15199},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15199}
}

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​