Skip to content

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Source: arXiv:2605.30263 · Published 2026-05-28 · By Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun et al.

TL;DR

minWM addresses the challenge of converting powerful but offline and multi-step video diffusion foundation models into real-time, interactive video world models that are camera-controllable, causal, and low-latency. The key innovation is a full-stack, open-source pipeline that covers data preparation with ground-truth camera trajectories, controllable fine-tuning of bidirectional diffusion backbones, and a multi-stage autoregressive (AR) diffusion distillation workflow (Causal Forcing / Causal Forcing++), culminating in few-step AR models suitable for live interactive applications. The framework is modular and architecture-agnostic, demonstrated on two large-scale open backbones (Wan2.1-T2V-1.3B and HY1.5-TI2V-8B) and supports adapting existing video world models to new latency and data requirements. Extensive ablations clarify key training conditions for controllability, such as data quality, training steps, and batch size.

Empirically, minWM achieves two hundred times reduction in first-frame latency compared to multi-step bidirectional baselines while preserving camera-conditioned video generation capabilities. It provides fully runnable scripts, checkpoints, and documentation enabling reproducible real-time interactive video world models. The work establishes an extensible recipe for research and engineering efforts targeting practical video world modeling with real-time user control and low-latency rollout.

Key findings

  • minWM achieves 223.75× first-frame latency reduction over multi-step bidirectional HY1.5 baseline on a single A800 GPU (Tab. 1).
  • minWM achieves 236.64× first-frame latency reduction over multi-step bidirectional Wan2.1 baseline on the same hardware.
  • Camera-controllable generation is preserved after distillation, supporting dynamic camera trajectory inputs (Fig. 2).
  • Models trained with estimated camera poses from SpatialVid [34] failed to learn reliable camera controllability under current setup (Fig. 3a).
  • Using ground-truth camera trajectories via 3D reconstruction and rendering (DL3DV dataset [35]) enables successful camera control learning (Fig. 3b).
  • WorldPlay-generated camera trajectories from OpenVid [36] similarly enable camera-controllable training in open-source setting (Fig. 3c).
  • Camera controllability emerges progressively during training; HY1.5 model is largely uncontrollable at 1-2K steps, begins to respond around 5K steps, and achieves strong controllability after 8K steps (Fig. 4).
  • Batch size critically affects camera control training; Wan2.1 model training fails below batch size 4, improves substantially at 8, and achieves high controllability at 16 (Fig. 5).

Threat model

n/a — The paper focuses on engineering methods for video world models rather than addressing adversarial threats. The interactive user supplies camera trajectories and expects low-latency, causal video generation.

Methodology — deep read

The core methodology follows a two-phase pipeline to convert text-to-video (T2V) or text-and-image-to-video (TI2V) bidirectional diffusion models into few-step autoregressive (AR) video world models with camera controllability, enabling real-time interaction.

Threat model: The adversary is not explicitly defined as this work targets the engineering of interactive video models rather than adversarial robustness, but the core assumption is an interactive user supplying camera trajectory inputs causing the model to generate plausible future frames causally and with low latency.

Data: Training data comprises videos with associated camera parameters (intrinsics and extrinsics). Ground-truth camera trajectories are emphasized as critical. Attempts to use SpatialVid [34] with estimated poses failed to yield stable controllable models, presumably due to pose estimation noise. Instead, DL3DV [35] 3D reconstructed scenes are rendered with prescribed trajectories, and OpenVid [36] images combined with WorldPlay [8] synthetic trajectories serve as effective open-source datasets.

Phase 1 (Camera-Controllable Bidirectional Model Fine-tuning): The backbone bidirectional diffusion model (Wan2.1-T2V-1.3B or HY1.5-TI2V-8B) is fine-tuned on camera-annotated videos using PRoPE relative positional encoding for camera parameters injected into self-attention layers. This enables conditioning the diffusion model on camera trajectories while maintaining the generative priors learned by the foundation models.

PRoPE uses lifted projective matrices combining camera intrinsics and world-to-camera extrinsics to transform query/key/value tensors within self-attention, encoding relative camera information between tokens.

Phase 2 (AR Diffusion Distillation): The camera-controllable bidirectional model is distilled into a few-step AR diffusion generator via a 3-stage process:

  1. AR Diffusion Training (teacher forcing): the model is fine-tuned under causal attention to autoregressively generate frames conditioned on past video frames and camera controls.
  2. Causal ODE or Causal Consistency Distillation (CD) Initialization: to reduce latency from multi-step generation, the model learns to generate few-steps directly by regressing noisy intermediate frames to clean frames, using teacher-generated PF-ODE trajectories or causal CD to avoid costly offline data generation.
  3. Asymmetric Distribution Matching Distillation (DMD): a post-training stage where the few-step AR model self-rolls out entire sequences and is aligned with the bidirectional teacher distribution using a diffusion score matching loss, improving generation quality.

All stages condition on camera parameters preserving controllability. Training uses batch size 32, learning rates around 1e-5 to 2e-6, with staged training steps (e.g., HY1.5: 8K bidirectional fine-tune, 4K AR stage 1, 1.5K stage 2, 500 stage 3). Autoregressive chunk size is 4 frames, and few-step AR uses 4 timesteps.

Evaluation measures first-frame generation latency, controllability via qualitative video samples, and ablations exploring training data, number of fine-tuning steps, and batch-size effects.

Reproducibility: The full pipeline is released open-source with runnable scripts, checkpoints for intermediate and final models, and documented inference code at https://github.com/shengshu-ai/minWM. The datasets used for training appear partially open (OpenVid, DL3DV) or synthetically generated via WorldPlay; SpatialVid is public but shown insufficient for camera control under current protocol.

Concrete example: Starting from Wan2.1-T2V-1.3B bidirectional model, fine-tune with PRoPE camera conditioning on WorldPlay-generated video data for 5K steps, then run AR diffusion training (4K steps), causal CD distillation (2K steps), and asymmetric DMD (200 steps) to yield a model that autoregressively generates 77-frame 480x832 videos conditioned on camera motion with just 4 denoising steps per chunk, cutting first-frame generation latency from 269.1s to 1.14s on an A800 GPU.

Technical innovations

  • Integration of PRoPE camera relative positional encoding into video diffusion self-attention to enable camera-controllable generation while maintaining bidirectional diffusion priors.
  • Full-stack pipeline combining camera-controllable bidirectional fine-tuning with Causal Forcing / Causal Forcing++ AR diffusion distillation stages (AR training, causal ODE/CD initialization, asymmetric DMD) into few-step AR video generators for real-time rollout.
  • Modular architecture-agnostic design demonstrated across cross-attention Wan2.1-T2V and MMDiT-style HY1.5-TI2V backbones.
  • Empirical ablation revealing critical dependency of camera-controllable training on accurate ground-truth camera trajectories vs noisy estimated poses.

Datasets

  • DL3DV — size unspecified — 3D reconstructed scenes with rendered camera trajectories
  • OpenVid — large-scale image dataset — combined with WorldPlay-generated camera trajectories
  • SpatialVid — large-scale video dataset with estimated camera poses — public, but found insufficient for controllable training under current setup

Baselines vs proposed

  • HY1.5 multi-step bidirectional: first-frame latency = 771.041s vs minWM few-step AR = 3.446s
  • Wan2.1 multi-step bidirectional: first-frame latency = 269.055s vs minWM few-step AR = 1.137s

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30263.

Fig 1

Fig 1: Overview of minWM. minWM is a full-stack pipeline that converts T2V/TI2V foundation models

Fig 2

Fig 2 (page 2).

Fig 2

Fig 2: Camera-controllable generation with the distilled few-step AR model. The model supports generation

Fig 3

Fig 3: Effect of training data on camera-controllable generation. Under our current setup, directly training

Fig 4

Fig 4: Effect of training steps on camera-controllable generation. Using HY1.5 as an example, we observe

Fig 6

Fig 6 (page 7).

Fig 7

Fig 7 (page 7).

Fig 8

Fig 8 (page 7).

Limitations

  • Training success depends heavily on availability of ground-truth or high-quality camera trajectories; current pose estimation from SpatialVid is insufficient.
  • Latency measurements exclude VAE encoding/decoding time, which may be substantial in deployment.
  • Quality of the few-step AR model, while improved over multi-step AR, still does not fully match bidirectional diffusion quality without post-training DMD.
  • Reported experiments are limited to two open backbones (Wan2.1, HY1.5); generalization to other architectures remains untested.
  • No thorough adversarial or distribution-shift robustness evaluations are provided.
  • Large batch size requirements (≥16) and multiple staged training phases may restrict replicability for researchers with limited compute.

Open questions / follow-ons

  • How to extend minWM to incorporate more diverse control signals beyond camera parameters, such as pose, object interaction, or semantic actions?
  • Can the framework be adapted to reduce batch size and training resources further while maintaining strong controllability?
  • To what extent can minWM generalize to other video diffusion architectures beyond Wan2.1 and HY1.5, especially closed-source or proprietary models?
  • How robust is camera-controllable interactive video generation under domain shifts or noisy user inputs in real-world deployment?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners exploring video-based human interaction verification, minWM provides valuable insights and tools for real-time, controllable video generation using foundation diffusion models. The emphasis on causal, low-latency autoregressive generation conditioned on camera trajectories may inspire novel interactive CAPTCHA challenges that adapt dynamically to user behavior in the video domain. minWM's open-source modular framework and publicly released training recipes lower the barrier to creating tailored interactive video models that can respond to articulated user inputs at low latency, a core requirement for engaging bot-defense tasks involving video or embodied user signals. However, application in security-critical settings would require further evaluation of adversarial robustness and data provenance for reliable deployment.

Cite

bibtex
@article{arxiv2605_30263,
  title={ minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models },
  author={ Min Zhao and Hongzhou Zhu and Bokai Yan and Zihan Zhou and Yimin Chen and Wenqiang Sun and Kaiwen Zheng and Guande He and Xiao Yang and Chongxuan Li and Fan Bao and Jun Zhu },
  journal={arXiv preprint arXiv:2605.30263},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30263}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution