CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
Source: arXiv:2605.19995 · Published 2026-05-19 · By Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu et al.
TL;DR
CogOmniControl addresses the challenge of controllable video generation under sparse, abstract, or complex conditions typical in professional workflows like storyboard sketches and clay renders. Existing methods struggle to accurately interpret such abstract inputs and align generated videos with the user's creative intent, often resulting in semantic misalignment, poor temporal dynamics, and artifacts. CogOmniControl factorizes the problem into two stages: creative intent cognition using a specialized Vision-Language Model (CogVLM) and video synthesis via a unified diffusion transformer (CogOmniDiT). By fine-tuning a base VLM on authentic anime production data and employing reinforcement learning, CogVLM more precisely interprets multimodal inputs and infers dense reasoning outputs that guide video synthesis.
The framework introduces a closed-loop architecture where CogVLM not only drives generation but also predicts appropriate evaluators to select the best video via a Best-of-N process, enhancing quality and alignment robustly. Two new professional benchmarks, CogReasonBench and CogControlBench, derived from real-world animation production data, validate the model's superior ability to understand creative intent and generate faithful videos. Extensive empirical results show CogOmniControl outperforms existing open-source models in multiple metrics, significantly closing the gap with leading proprietary systems.
Key findings
- CogVLM fine-tuned with supervised and reinforcement learning improves multimodal intent understanding, raising average reasoning scores from ~3.7 (base VLM) to 4.47 on CogReasonBench (Table 2).
- CogOmniControl achieves an overall VLM-as-a-Judge score of 0.727 on CogControlBench, surpassing open-source competitors like VINO (0.686) and VACE-Wan2.1 (0.665), and approaching proprietary Seedance2.0 (0.750) (Table 3).
- Best-of-N selection (N=4) guided by adaptive evaluator harness predicted by CogVLM further boosts performance to 0.742, showing effectiveness of closed-loop verification over fixed evaluators.
- Ablation studies reveal reinforcement fine-tuning of CogVLM and CogOmniDiT yields incremental improvements in multimodal intent alignment and video quality metrics compared to supervised fine-tuning alone (Table 4).
- Qualitative comparisons demonstrate CogOmniControl better preserves identity consistency, motion naturalness, and avoids ghosting/artifacts under abstract conditions like clay renders, unlike adapter-based or generic VLM-guided baselines.
- CogOmniControl handles heterogeneous input conditions including pose, depth, line art, storyboard sketches, and clay renders within a unified sequence modeling framework (Eq. 5).
- The holistic reward in CogVLM training incorporates creative intent, physical plausibility, information integrity, and motion description, quantitatively guiding reasoning improvements (Eq. 3).
- CogVLM's ability to adapt evaluator selection per input condition reflects deeper understanding of task-specific quality criteria, improving Best-of-N video selection.
Methodology — deep read
Threat Model & Assumptions: The adversary here is not explicit, as the work targets creative video generation robustness rather than security. The main challenge addressed is the model's capacity to correctly infer and align creative intent from sparse, abstract, or multimodal conditions (storyboard sketches, clay renders, reference images, textual descriptions). The system assumes input conditions may contain semantic conflicts or discrepancies that need to be reconciled.
Data: Training data includes a newly curated dataset from authentic professional anime production workflows comprising aligned storyboard clips, clay render videos, and their corresponding final videos. The authors also use general video generation data from community sources and VACE-Bench. CogReasonBench and CogControlBench benchmarks were created with about 200 high-resolution samples (720p), annotated for semantic alignment and carrying real creative intent instead of synthetic data, enabling rigorous evaluation of intent cognition and controllable generation.
Architecture / Algorithm:
- CogVLM: A fine-tuned Vision-Language Model based on Qwen3-VL-8B-Thinking. It processes multimodal inputs (control video, reference image, textual prompt) and performs visual reasoning to infer the dense creative intent that guides generation. Training involves supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) using holistic and accuracy rewards targeting intent, physics, integrity and motion.
- CogOmniDiT: A video diffusion transformer that generates the video by concatenating noisy latent inputs, latents of reference images and control videos, plus the semantic embeddings from CogVLM (Eq. 5). It fuses diverse conditions via in-context learning in self-attention layers to preserve multifaceted control.
- Evaluator Harness: CogVLM additionally predicts a set of evaluators tailored to the input conditions to score candidate videos during a Best-of-N selection at inference, closing the loop between reasoning, generation, and verification.
- Training Regime:
- CogVLM is supervised fine-tuned via LoRA for 3 epochs at 1e-5 learning rate, with an additional reinforcement fine-tuning for 500 steps using advantage-weighted policy optimization driven by composite rewards.
- CogOmniDiT undergoes a three-stage LoRA training: initial LoRA on the diffusion backbone, adding a trainable connector module aligned to frozen CogVLM, followed by joint fine-tuning.
- Hardware used includes 32 NVIDIA H20 GPUs.
- Evaluation Protocol:
- Metrics include quantitative VLM-as-a-Judge scores using Gemini3.1-Pro LLM to assess multimodal intent alignment and video quality (aesthetic, image quality, temporal flickering, motion smoothness, identity consistency).
- Benchmarked on CogReasonBench (for VLM reasoning capabilities) and CogControlBench (for video controllability and quality).
- Comparisons are made against established open-source and proprietary baselines such as VINO, VACE, OmniWeaving, Seedance2.0.
- Ablations test the effects of supervised vs reinforcement fine-tuning on both CogVLM and CogOmniDiT.
- Reproducibility:
- The authors provide a project website with documentation.
- Details about code release or pretrained weights are unclear from the paper.
- Dataset CogControlBench and CogReasonBench are newly introduced and may be proprietary or partially open.
Example end-to-end: Given an abstract clay render video of a character flying through a forest and a low-detail storyboard plus reference image of the protagonist, CogVLM interprets these multimodal cues to infer detailed creative intent (e.g., clothing fluttering, lighting changes). It outputs a reasoning text plus selects evaluators for the Best-of-N video generation. CogOmniDiT receives noisy latents plus the CogVLM embeddings and synthesizes candidate videos aligned to the inferred intent. The evaluator harness ranks these videos, selecting the faithful final output preserving identity and smooth motion while matching the abstract conditions.
Technical innovations
- Specialized CogVLM trained via supervised and reinforcement fine-tuning on authentic anime production data to deeply cognize abstract and sparse creative intent from multimodal inputs, surpassing generic VLMs.
- Unified video diffusion transformer (CogOmniDiT) that concatenates noisy latent, control video latent, reference image latent, and semantic VLM embeddings enabling fusion of heterogeneous conditions in self-attention for precise controllable video generation.
- Closed-loop reasoning-generation-verification framework where CogVLM predicts per-input adaptive evaluator sets used in Best-of-N sampling to optimize final video quality and alignment.
- Holistic reward design combining creative intent, physical plausibility, information integrity, and motion description metrics for reinforcement fine-tuning of the vision-language reasoner (CogVLM), transforming subjective evaluation into verifiable accuracy.
Datasets
- CogControlBench — 200 high-resolution (720p) samples — proprietary professional anime production workflow data
- CogReasonBench — professional storyboards, clay render videos, and final videos — proprietary curated from in-house anime pipeline
- VACE-Bench — unspecified size — public/community
- General controllable generation data — various sources from community collections
Baselines vs proposed
- VINO: overall score = 0.686 vs CogOmniControl: 0.727
- VACE-Wan2.1: overall score = 0.665 vs CogOmniControl: 0.727
- Seedance2.0 (proprietary): overall score = 0.750 vs CogOmniControl: 0.727
- CogOmniControl (Best-of-N with adaptive evaluator): 0.742 vs CogOmniControl (single sample): 0.727
- Qwen3-VL-8B-Thinking + CogOmniDiT(SFT): multimodal intent metric = 3.142 vs CogVLM(RFT) + CogOmniDiT(RFT): 3.588 (ablation)
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.19995.

Fig 1: The motivation of our CogOmniControl. The adapter-based methods and video generation

Fig 2: The overall framework of the proposed CogOmniControl. During inference, CogVLM

Fig 3 (page 2).

Fig 4 (page 2).

Fig 5 (page 2).

Fig 6 (page 2).

Fig 7 (page 2).

Fig 8 (page 2).
Limitations
- The new benchmarks CogControlBench and CogReasonBench are relatively small scale (~200 samples), limiting broad generalization claims.
- Code release and dataset public availability are not clearly stated, potentially hindering reproducibility and external validation.
- No adversarial robustness evaluation or stress testing under noisy or conflicting conditions beyond professional workflows was presented.
- Evaluation relies heavily on LLM-based VLM-as-a-Judge metrics, which may inherit biases or limitations of the evaluator models.
- The framework, while powerful, requires large-scale high computational resources (32 GPUs) which may limit accessibility.
- No explicit user study or human preference evaluation for subjective creative alignment was reported.
Open questions / follow-ons
- How well does CogOmniControl generalize to domains beyond anime-style professional workflows, such as real-world or diverse video content?
- Can the evaluator harness framework be extended to incorporate human feedback dynamically during inference to further improve video selection?
- What is the robustness of CogVLM reasoning and video generation under adversarial or highly noisy abstract control inputs?
- How scalable is the closed-loop Best-of-N approach in terms of computational cost versus generation quality improvement?
Why it matters for bot defense
From a bot-defense and CAPTCHA perspective, CogOmniControl's methodology illustrates how embedding domain-specialized reasoning models (CogVLM) trained on authentic professional intent data can significantly enhance the alignment between abstract inputs and high-dimensional outputs—in this case, videos. This approach underscores the utility of factorizing control into cognition and generation stages, potentially inspiring more robust CAPTCHA generation that intelligently interprets or verifies user intents embedded in complex multimodal challenges.
Additionally, the closed-loop evaluator harness approach leveraging reasoning to adaptively select quality measures and best outputs could be adapted to CAPTCHA systems that generate dynamic challenges evaluated through contextual and task-specific scoring, improving adaptability against automated solvers while maintaining human usability. The emphasis on sparse, abstract, and multimodal conditions maps well to designing CAPTCHA puzzles that resist straightforward spoofing or simplistic pattern recognition by bots.
Cite
@article{arxiv2605_19995,
title={ CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition },
author={ Hongji Yang and Songlian Li and Yucheng Zhou and Xiaotong Zhao and Alan Zhao and Chengzhong Xu and Jianbing Shen },
journal={arXiv preprint arXiv:2605.19995},
year={ 2026 },
url={https://arxiv.org/abs/2605.19995}
}