Quantitative Video World Model Evaluation for Geometric-Consistency

Source: arXiv:2605.15185 · Published 2026-05-14 · By Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

TL;DR

This paper addresses the critical challenge of quantitatively evaluating geometric and physical consistency in generative video models, which are increasingly framed as implicit world models. Traditional evaluation metrics focus on perceptual quality and semantics but lack the capacity to detect fundamental 3D geometric failures such as scale-depth misalignment, non-rigid motions, and inconsistent 3D trajectories in generated videos. To close this gap, the authors propose PDI-Bench, a comprehensive evaluation framework that leverages a perceptual pipeline combining advanced segmentation (SAM 2), monocular 3D reconstruction (MegaSaM), and temporal tracking (CoTracker3) to lift 2D video content into a verifiable 3D space. They define the Perspective Distortion Index (PDI), an interpretable metric aggregating residual errors in scale-depth invariance, 3D motion consistency, and structural rigidity to diagnose physical hallucinations in generated videos. Accompanying the framework, the authors introduce PDI-Dataset, covering five diverse scenarios stressing geometric constraints. Extensive quantitative evaluation of six state-of-the-art video generators reveals persistent geometry-specific failure modes missed by standard perceptual metrics, with some strong models approaching ground-truth fidelity while others exhibit severe scale hallucinations and motion jitter. A human expert study confirms PDI's alignment with human judgment. The work offers a rigorous physical yardstick for progress toward video-based world models with realistic 3D geometry and motion.

Key findings

Ground Truth (real-world) videos achieve a PDI score of 0.1206 with minimal geometric errors, establishing a low baseline for physical consistency.
Seedance 2.0 achieves the best generative model PDI score at 0.2422, with 0% outliers and an 89.3% MathPass rate, indicating superior geometric self-consistency.
CogVideoX-3 performs near GT on motion consistency (ε_t = 0.2033) and structural rigidity (ε_r = 0.2065), showing smooth and coherent object synthesis.
Transformer-based models Sora and HunyuanVideo show severe scale hallucinations with ε_s > 1.67, a 25x increase over GT, causing volume breathing and inconsistent object sizes.
Curved motion scenarios cause catastrophic failures in models like Sora (PDI = 2.13) with massive scale distortion (ε_s = 4.87), reflecting poor perspective invariance during rotations.
Partial occlusion tests reveal that HunyuanVideo suffers severe degradation (PDI = 2.41, ε_s = 5.38) indicating forgetting of object scale and loss of physical permanence.
A human expert study with 7 CV experts shows perfect rank-order correlation (ρ=1.0) between subjective physical realism scores and PDI rankings, confirming PDI's perceptual validity.
Autoregressive long-video generation on Wan2.1 shows scale drift (ε_s ≈ 2.86) despite stable smooth trajectories (ε_t ≈ 0.32), indicating loss of 3D volume memory over time.

Threat model

The adversary is a generative video model that produces synthetic video sequences intended to appear visually plausible but may harbor underlying physical inconsistencies such as incorrect 3D scale, implausible motion, or structural deformations. The evaluator has no access to the model internals or ground-truth 3D data, relying only on monocular RGB video to detect violations of fundamental physical and geometric laws. The adversary cannot alter camera parameters or 3D reconstruction steps, and the evaluation assumes the perception pipeline yields sufficiently accurate 3D lifts and segmentations.

Methodology — deep read

The proposed PDI-Bench framework operates under the threat model of evaluating black-box generative video models producing synthetic sequences without direct access to internal 3D representation or ground-truth 3D data, focusing on detecting physical consistency failures purely from monocular RGB videos. The pipeline transforms 2D video frames into 3D coordinate trajectories for evaluation. Data consists of the newly curated PDI-Dataset comprising 183 video sequences: 15 real-world high-quality ground truth videos plus 168 synthetic videos generated from 6 state-of-the-art models (3 open-source, 3 closed-source). These cover 5 physically challenging scenarios: longitudinal convergence, dynamic tracking, biological motion, curved motion, and partial occlusion. Preprocessing includes automated text-to-box prompting via Florence-2, segmentation masks via SAM 2, 3D monocular reconstruction with MegaSaM producing time-sequence pointmaps and camera poses, and dense point tracking via CoTracker3 to produce anchor trajectories.

The core algorithm implements a Target-Uplift-Anchor pipeline: Semantic Targeting isolates objects per frame, Geometric Uplifting lifts 2D pixels to 3D world coordinates decoupled from camera motion, and Structural Anchoring seeds and tracks point anchors through CoTracker3 to obtain precise 3D anchor trajectories. These 3D trajectories serve as the basis for three physically motivated residual computations:

Scale-Depth Alignment residual (ε_scale): tests the invariance of pixel height multiplied by depth per frame to detect volume breathing or unnatural scale changes.
3D Motion Consistency residual (ε_traj): audits kinematic plausibility by evaluating 3D centroid velocity and acceleration, penalizing unnatural jitter or sudden directional reversals.
Structural Rigidity residual (ε_rigidity): measures median absolute deviation of pairwise 3D anchor distances over time to detect non-Euclidean deformations (e.g., jello effect). The final Perspective Distortion Index (PDI) is a weighted sum of these residuals (weights empirically set to 0.4, 0.4, and 0.2 respectively).

Training regime details do not apply here as this is an evaluation benchmark; the backbone perceptual models (SAM 2, MegaSaM, CoTracker3) are pre-existing and frozen. Evaluation involves calculating PDI scores for each video in the PDI-Dataset across the 6 generative models and ground truth, stratified by scenario. Metrics reported are RMSE of residuals and derived statistics like standard deviation, outlier percentage, and a MathPass rate indicating physically plausible clips. Statistical confidence intervals are reported for PDI values. A human expert perceptual study with 7 CV experts scoring 105 clips validates PDI’s alignment with subjective judgments. The paper includes a case study analyzing autoregressive generation beyond training context.

Reproducibility: The authors release PDI-Bench code and the PDI-Dataset publicly at https://pdi-bench.github.io/, enabling external validation and further research on geometric evaluation.

Technical innovations

PDI-Bench introduces an explicit quantitative framework to audit 3D geometric consistency in generative videos by lifting 2D pixel data into a verifiable 3D coordinate system decoupled from camera ego-motion.
The Perspective Distortion Index (PDI) synthesizes three orthogonal residuals (scale-depth alignment, 3D motion consistency, structural rigidity) as physically grounded, interpretable metrics to detect specific types of geometric hallucinations.
The Target-Uplift-Anchor perceptual pipeline uniquely combines state-of-the-art video segmentation (SAM 2), monocular 3D reconstruction (MegaSaM), and dense tracking (CoTracker3) to generate reliable 3D anchor trajectories within objects for fine-grained physical auditing.
The authors present PDI-Dataset, a novel benchmark emphasizing geometrically challenging scenarios designed to systematically stress-test spatial consistency in generated videos.
Demonstration that established perceptual metrics like FVD and CLIP-based scores do not capture fundamental 3D geometric errors, highlighting the novel insight and diagnostic value of PDI-Bench.

Datasets

PDI-Dataset — 183 videos (15 real-world GT, 168 synthetic) — publicly released at https://pdi-bench.github.io/

Baselines vs proposed

Ground Truth (GT): PDI score = 0.1206 vs Seedance 2.0: PDI score = 0.2422 (best generative model)
CogVideoX-3: motion residual ε_t = 0.2033, rigidity residual ε_r = 0.2065 near GT vs Sora: scale residual ε_s = 1.6753 (severe distortion)
Seedance 2.0: 0% outliers and 89.3% MathPass rate vs HunyuanVideo: 14.3% outliers and 57.1% MathPass rate
Autoregressive Wan2.1: scale error ε_s = 2.8583 with stable trajectory error ε_t = 0.3170, indicating scale drift uncontested by motion smoothness

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.15185.

Fig 1

Fig 1: ∣Overview of the PDI-Bench Evaluation. (Top) Qualitative samples from our dataset, featuring

Fig 2

Fig 2: ∣The Three key perspectives that PDI-Bench is evaluating for geometric consistency. (Left)

Fig 3

Fig 3: ∣Overview of the PDI-Dataset and Experimental Setup. Our benchmark comprises 183 high-

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

PDI-Bench relies on monocular 3D reconstruction (MegaSaM) and tracking (CoTracker3), which may have residual noise and partial inaccuracies affecting final metrics.
Evaluation focuses on rigid or semi-rigid objects; non-rigid or highly deformable scenarios are only partially addressed via structural rigidity residual but remain challenging.
The dataset includes only 15 real-world ground truth videos, which might limit generalizability of physical baselines in diverse real scenes.
The human expert study involves 7 experts only, which although consistent, is a relatively small perceptual validation sample.
PDI as a diagnostic tool assumes stable camera pose estimation; extreme viewpoint changes or erratic camera motion may degrade 3D uplift precision and metric reliability.
No adversarial evaluation or robustness tests against deliberate physical inconsistency generation are reported.

Open questions / follow-ons

How to extend PDI-Bench to handle highly deformable objects or fluid dynamics where rigidity assumptions break down?
Can real-time or online geometric audits be developed for use during generative video synthesis to impose physical constraints proactively?
What improvements to monocular 3D reconstruction or tracking accuracy are necessary to reduce noise and false positives in geometric consistency metrics?
How do these geometric consistency failures impact downstream tasks relying on video world models, such as robotics or AR?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, PDI-Bench offers a rigorous, physically grounded evaluation methodology to distinguish truly spatially coherent synthetic videos from superficially plausible fakes. Its metrics could potentially inform defenses against bots or generative content that violates geometric consistency, such as in video CAPTCHAs or screens with moving objects. Embedding geometric consistency checks may raise the bar for realistic forged video generation, forcing adversaries to produce physically believable 3D motions rather than mere 2D pixel-level plausibility. However, operationalizing these complex 3D audits in real-time or low-latency settings remains a challenge. Practitioners may also take insight from PDI-Bench’s diagnostic framework to design or test generative adversarial techniques that preserve latent world structure, contributing to detection or filtering pipelines.

Cite

bibtex

@article{arxiv2605_15185,
  title={ Quantitative Video World Model Evaluation for Geometric-Consistency },
  author={ Jiaxin Wu and Yihao Pi and Yinling Zhang and Yuheng Li and Xueyan Zou },
  journal={arXiv preprint arXiv:2605.15185},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15185}
}

Quantitative Video World Model Evaluation for Geometric-Consistency ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​