WorldOlympiad: Can Your World Model Survive a Triathlon?
Source: arXiv:2606.11129 · Published 2026-06-09 · By Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu et al.
TL;DR
WorldOlympiad introduces a comprehensive benchmark to evaluate video-based world models across three critical dimensions: physical faithfulness, geometric consistency, and interaction fidelity. Existing benchmarks predominantly focus on visual quality or semantic alignment in short clips, failing to diagnose whether generative models respect physical laws, maintain coherent 3D structure, or sustain controllable interactions over long horizons. WorldOlympiad fills this gap by decomposing evaluation into three tracks—physical (mechanics, thermodynamics, material properties), geometry (3D reconstruction, camera trajectory consistency), and interaction (instruction following and chunk-to-chunk coherence).
The benchmark spans three major downstream domains—gaming, robotics, and general real-world videos—covering diverse challenges from embodied manipulation to open-domain motion and camera dynamics. It includes a dataset of 1,000 carefully curated and annotated long videos with a multi-stage chunk captioning pipeline enabling detailed evaluation prompts. Evaluation of eight state-of-the-art long-video generation models using WorldOlympiad reveals significant deficits in long-horizon physical reasoning, geometric stability, and interactive control despite recent advances in video diffusion models. These results underscore the need for more structured, domain-agnostic evaluation protocols to rigorously diagnose and guide future video world model development.
Key findings
- LingBot-World achieves the highest overall score of 0.683, with 0.942 physical faithfulness and 0.734 interaction fidelity, demonstrating strong long-horizon state preservation.
- Cosmos-Predict-2.5 reaches 0.671 overall and 0.906 physical faithfulness despite a smaller scale (2B parameters), showing targeted physical training can offset model size.
- Geometry consistency scores remain low across all models, with the strongest at only 0.424 (Hunyuan-WorldPlay), highlighting unresolved challenges in maintaining 3D spatial structure.
- Physical faithfulness is uneven: mechanics-related rules are better respected than thermodynamics or material property behaviors, which remain fragile under generation.
- Interaction fidelity measured by chunk-level caption alignment and smooth transitions reveals models struggle to maintain smooth, coherent long-term interactions.
- Eight evaluated video-generation pipelines show a systematic gap between visual plausibility and deeper world-model capabilities such as physics and 3D consistency.
- The benchmark dataset contains 1,000 videos split into 400 robotics, 400 gaming, and 200 real-world long videos, enabling cross-domain evaluation.
- Domain-specialized models (LingBot-World gaming, Cosmos-Predict-2.5 robotics) generalize reasonably well to other domains, but some specialized models like WoW underperform outside their focus.
Threat model
The adversary is a generative video-based world model that predicts future frames from past observations and control inputs. The model is assumed not to have direct access to ground-truth physical parameters or future states but must generate outputs conforming to physical laws, coherent 3D structure, and faithful interactive behavior across long temporal horizons. The evaluation does not consider adversarial manipulation or poisoning of training data, nor does it assume access to hidden scene state beyond video frames and control captions.
Methodology — deep read
The WorldOlympiad benchmark is designed to evaluate video-based world models with a strong focus on 1) physical realism, 2) geometric 3D consistency, and 3) interaction fidelity over long horizons.
Threat Model & Assumptions: The adversary here is implicit—video world models generating future frames conditioned on past frames and control signals. The evaluation assumes models do not have privileged knowledge of ground truth future but must generate physically plausible, geometrically consistent, and semantically aligned videos with smooth interaction transitions.
Data Provenance and Preparation: The benchmark dataset comprises 1,000 manually curated long videos—400 robotics videos from RoboCOIN (bimanual robotic manipulation), 400 gaming videos from GameGen-X (open-world game video dataset), and 200 real-world videos from LVD-2M (long-take videos with motion). Videos are chunked into up to six non-overlapping temporal segments, each captioned by a 3-stage pipeline using Gemini-3-Pro-Preview MLLM for chunking, captioning (action labels + scene captions), then refinement to ensure terminological consistency and remove hallucinations.
Architecture/Algorithm: The evaluation framework uses multiple AI models for assessment:
- Physical evaluation uses SAM to segment relevant objects and MLLM to judge adherence to physical rules (gravity, buoyancy, melting, combustion, etc.) via compliance classification with explanation and confidence.
- Geometry evaluation reconstructs generated videos into Gaussian-splat 3D scenes via Depth Anything 3 pipeline, generating 3D renderings and camera trajectories. MLLM judges spatial structure, meta-view quality, and alignment between predicted and reference trajectories to compute a normalized 3D consistency score.
- Interaction evaluation combines CLIP cosine similarity between frames and chunk captions with a multi-level MLLM rubric scoring chunk-level visual/text alignment, chunk-transition smoothness, and global video coherence.
Training Regime: Not applicable, as the models under test are publicly available video generation pipelines used off-the-shelf.
Evaluation Protocol: Each model generates videos corresponding to the 1,000 benchmark videos with original chunk captions mapped to native chunk sizes. Scores are computed per video then averaged across subsets and domains. Physical compliance is averaged over applicable metrics per subset. Geometry scores combine static scene reconstruction, meta-views, and camera trajectory. Interaction scores blend CLIP semantic adherence with MLLM multi-level judging. The three track scores are equally weighted to yield an overall leaderboard ranking.
Reproducibility: Code and dataset splits are released at the project Github. Evaluations use pretrained publicly available models and a standardized automated pipeline implementing the metrics. However, full reproduction depends on access to model checkpoints for the 8 long-video generation pipelines evaluated, some of which may be private or partially closed.
Technical innovations
- Decomposition of video-based world model evaluation into three complementary assessment tracks: physical law adherence, 3D geometric consistency, and long-horizon interaction fidelity.
- Use of multimodal large language models (MLLMs) as judges to interpret physical phenomena, assign compliance scores, and evaluate chunk-level, transition, and global interaction coherence.
- Geometry evaluation via reconstruction of generated videos into Gaussian-splat 3D representations and evaluation of cross-view scene consistency and camera trajectory alignment.
- A three-stage chunk-caption-refine pipeline that uses MLLMs and segmentation to generate finely curated temporal chunks annotated with action labels and scene captions to support interpretable evaluation.
Datasets
- WorldOlympiad Benchmark — 1,000 videos (400 robotics from RoboCOIN, 400 gaming from GameGen-X, 200 real-world from LVD-2M) — curated and filtered for physical consistency and long-video generation assessment
Baselines vs proposed
- Matrix-Game 2.0: Overall score = 0.231 vs LingBot-World: 0.683
- Cosmos-Predict-2.5: Physical faithfulness = 0.906 vs LingBot-World: 0.942
- Hunyuan-WorldPlay: 3D consistency = 0.424 vs LingBot-World: 0.373
- WoW: Interaction fidelity = 0.345 vs Cosmos-Predict-2.5: 0.707
- Rolling Forcing: Overall score = 0.610 vs LongLive: 0.584
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.11129.

Fig 1: Overview of the WorldOlympiad pipeline for data collection, long-video generation, and multi-dimensional

Fig 2: Data collection overview across robotics, gaming, and real-world video sources.

Fig 3 (page 2).

Fig 4 (page 2).

Fig 5 (page 2).

Fig 6 (page 2).

Fig 7 (page 2).

Fig 8 (page 2).
Limitations
- Physical reasoning evaluation relies on MLLM judgment and segmentation masks, which may introduce bias or errors if object segmentation or MLLM understanding is imperfect.
- Geometric evaluation is limited to static scene reconstructions with dynamic object removal, possibly missing dynamics consistency in moving objects.
- Interaction evaluation uses CLIP similarity as a lightweight signal, which can be insensitive to finer aspects of video semantics or action fidelity.
- Benchmark scenarios focus on three specific downstream application domains, potentially limiting generalization to other video world model uses.
- Evaluations do not include adversarial or out-of-distribution attacks on the models to stress-test robustness of physical or interaction reasoning.
- Some evaluated models are closed-source or rely on private data, limiting full reproducibility of generation results.
Open questions / follow-ons
- How can video world models better internalize and generalize complex thermodynamic and material property behaviors, which currently remain fragile under generation?
- What architectural or memory mechanisms can close the large gap in 3D spatial consistency and camera trajectory alignment in generative video models?
- Can interactive generation techniques be improved to ensure smoother, more coherent chunk-to-chunk transitions over extremely long time horizons?
- How well do the proposed evaluation metrics hold under adversarial scenarios or unseen domain shifts beyond the benchmark's datasets?
Why it matters for bot defense
From a bot-defense and CAPTCHA perspective, WorldOlympiad offers a rigorous benchmark to evaluate the fidelity of video generation models that aim to simulate realistic environments over time. For CAPTCHAs or challenge-response tests involving video, understanding whether generative models can faithfully obey physical laws, maintain consistent 3D geometry, and sustain coherent interactive actions is critical to detect synthetic or bot-generated content. The multidimensional metrics combining physical rule compliance, geometric consistency, and interaction alignment provide interpretable signals that can augment bot-detection heuristics beyond traditional visual quality or semantic checks. Moreover, the evaluation of long-horizon interaction fidelity parallels the challenges of verifying whether an agent can follow complex instructions over sequential inputs, a key concern in interaction-based defenses.
Bot-defense engineers might leverage WorldOlympiad’s methodology to develop new assessments or verification modules for video CAPTCHA systems that test not only visual realism but also the underlying world-model consistency of the candidate responses. This can help distinguish genuine human-generated or physically consistent interactions from those produced by generative models that fail to maintain stable physics or geometry over time. However, applying these techniques requires access to sophisticated segmentation, Gaussian-splat reconstruction, and MLLM-based judgements currently not trivial to deploy in real-time service contexts.
Cite
@article{arxiv2606_11129,
title={ WorldOlympiad: Can Your World Model Survive a Triathlon? },
author={ Yuke Zhao and Wangbo Zhao and Weijie Wang and Zeyu Zhang and Dakai An and Akide Liu and Yinghao Yu and Jiasheng Tang and Fan Wang and Wei Wang and Bohan Zhuang },
journal={arXiv preprint arXiv:2606.11129},
year={ 2026 },
url={https://arxiv.org/abs/2606.11129}
}