Skip to content

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Source: arXiv:2605.12480 · Published 2026-05-12 · By Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu et al.

TL;DR

OmniNFT tackles a practical failure mode in joint audio-video generation: when you try to post-train a dual-stream diffusion model with reinforcement learning, a single scalar reward signal can improve one modality while harming the other, or improve both but damage synchronization. The paper argues that vanilla RLVR/GRPO-style fine-tuning is mismatched to this setting because the audio and video branches do not share the same optimization geometry. In particular, the authors identify three bottlenecks: inconsistent advantages across modalities, gradient leakage from video supervision into shallow audio layers, and uniform credit assignment that fails to focus learning on synchronization-critical regions.

The proposed fix is OmniNFT, a modality-aware online diffusion RL framework built around three mechanisms: routing separate advantages to the audio/video branches, surgically detaching some video-to-audio gradients in shallow layers, and reweighting the video loss using V2A cross-attention maps as a proxy for critical regions. On JavisBench and VBench with LTX-2, OmniNFT improves the overall balance of perceptual quality, cross-modal consistency, and AV synchronization. The strongest quantitative gain reported is a large reduction in DeSync on JavisBench from 0.569 to 0.269, while also raising visual quality from 2.038 to 3.326 and audio quality from 5.197 to 5.715.

Key findings

  • On JavisBench, LTX-2+OmniNFT improves visual quality from 2.038 to 3.326 (+63.2%) and audio quality from 5.197 to 5.715 (+10.0%) relative to the pretrained LTX-2 baseline.
  • OmniNFT reduces DeSync from 0.569 to 0.269 on JavisBench, a 52.7% decrease, and outperforms LTX-2+GDPO at 0.412.
  • Compared with LTX-2+GDPO, OmniNFT raises CLAP from 0.428 to 0.445 and AVHScore from 0.223 to 0.257 on JavisBench.
  • Table 3 shows modality-wise advantage routing alone lowers DeSync from 0.412 to 0.322 versus vanilla RL, while improving AV-IB from 0.233 to 0.248 and AVHScore from 0.223 to 0.240.
  • Table 3 shows layer-wise gradient surgery improves audio quality from 5.399 to 5.917 over the modality-routing-only variant, while also increasing JavisScore from 0.199 to 0.209.
  • The ablation on gradient-surgery placement shows shallow-layer detachment is better than deep-layer detachment: VQ 3.326 vs 3.083, AQ 5.715 vs 5.577, and DeSync 0.220 vs 0.204 in Table 2.
  • The region-weighting coefficient has a non-monotonic effect: λ=1.50 is the reported default and best trade-off, while λ=1.25 underperforms it (VQ 3.150 vs 3.326; AQ 5.495 vs 5.715) and λ=1.75 slightly hurts VQ (2.977) despite comparable AQ (5.714).

Threat model

The adversary is the optimization mismatch induced by multi-objective joint generation under RL fine-tuning, not a malicious external actor. The model assumes a pretrained dual-stream audio-video generator, access to reward functions for video quality, audio quality, text alignment, and synchronization, and online sampling of grouped rollouts from the current policy. It does not assume the ability to inspect or alter user prompts beyond the training loop, and it does not consider explicit adversarial attacks on the reward models or benchmark leakage. What cannot be done in the assumed setting is to rely on a single scalar reward to correctly supervise both modalities without inducing conflicting updates.

Methodology — deep read

The paper frames joint audio-video generation as a multi-objective RL problem under a dual-stream diffusion backbone. The adversary is not an external attacker but the optimization process itself: the model must satisfy per-modality fidelity, text alignment, cross-modal semantic consistency, and AV synchronization at the same time. The authors’ key assumption is that a pretrained dual-stream policy already exists and can be improved online by sampling groups of joint outputs, scoring them with reward models, and then updating the same model with a diffusion-RL objective. They specifically study a LTX-2 backbone and compare against vanilla RLVR-style fine-tuning and GDPO. The threat surface in the paper is therefore optimization mismatch, not malicious users.

Data-wise, the paper’s empirical diagnosis of advantage inconsistency uses 1,400 generated samples from 175 prompts with group size 8. Those samples are used to compute separate audio and video advantages and to show that nearly half of the group members receive conflicting signs across modalities. For training and evaluation, the paper uses JavisBench and VBench. The text does not give a full breakdown of JavisBench size, exact train/test splits, or whether the prompts used for RL fine-tuning overlap with the evaluation set; those details are not fully specified in the excerpt. The reward models are also specified at a high level: VideoAlign and HPSv3 for video quality, Audiobox Aesthetics for audio quality, CLAP for audio-text alignment, and Desync for audio-visual synchronization. The main evaluation metrics on JavisBench are VQ, AQ, TV-IB, TA-IB, CLIP, CLAP, AV-IB, AVHScore, JavisScore, and DeSync; VBench is used as an additional video-quality benchmark. The paper does not report train/val/test splits for the reward models, because they are used as frozen scorers rather than learned components in OmniNFT.

Architecturally, OmniNFT is an online post-training wrapper around a dual-stream joint audio-video flow-matching model. In the underlying model, audio latent x_a and video latent x_v are denoised in parallel under a shared time schedule, with cross-modal attention layers mediating interaction. OmniNFT adds three mechanisms. First, modality-wise advantage routing computes a separate advantage A_v, A_a, and A_av for video quality, audio quality, and AV synchronization. The routed advantage for the video branch is A_v + A_av, while the audio branch receives A_a + A_av. This is intended to avoid collapsing heterogeneous rewards into one scalar. Second, layer-wise gradient surgery acts only on A2V cross-attention. For Transformer blocks l below a shallow boundary L, the audio key/value tensors are partially detached: ˜K_a = α_s sg(K_a) + (1-α_s)K_a and similarly for V_a, with default L=10 and α_s=0.1. This leaves the forward pass unchanged but scales backpropagated gradient through the audio KV path in shallow layers, which the authors claim are primarily responsible for intra-modal audio generation. Third, region-wise loss reweighting uses V2A attention maps from the deep cross-modal blocks and late denoising steps as an intrinsic mask for critical regions. The token-wise score is computed by averaging attention over deep blocks and late steps, normalized, and turned into a weight w_i = 1 + λ·normalized_score; the final video loss becomes a weighted sum over tokens, while the audio loss remains unchanged.

Training follows an online loop summarized in Algorithm 1. For each prompt c, the model samples G joint outputs, evaluates the three rewards R_v, R_a, and R_av, computes separate group-normalized advantages, routes them to the modality branches, converts routed advantages to optimality probabilities using the same DiffusionNFT-style transformation, caches V2A attention maps, and stores the results in a buffer. During the update stage, the model recomputes the forward/backward pass with gradient surgery enabled, applies region-weighted video loss plus standard audio loss, and updates the online policy; the old policy is then moved toward the current policy by an EMA-like update with factor η_i. A concrete end-to-end example in the paper is the SpongeBob/Patrick prompt: the model samples joint video and audio for the scene, the reward models score visual quality, audio aesthetics, and AV sync, the V2A attention map highlights speaking regions, and the update emphasizes those regions while protecting shallow audio layers from being overwritten by video gradients.

Evaluation is mostly comparative and ablation-based. The main table compares LTX-2, LTX-2+GDPO, and LTX-2+OmniNFT on JavisBench, plus earlier joint-generation baselines such as JavisDiT, UniVerse-1, and JavisDiT++. OmniNFT is reported as best overall on perceptual quality, cross-modal consistency, and synchronization, though not every metric moves in the same direction: TV-IB and CLIP do not improve versus baseline, and the paper explicitly notes that text-video semantic alignment remains challenging. Ablation Table 3 incrementally adds each OmniNFT component atop vanilla RL and reports efficiency; the full method adds only negligible overhead, with training time increasing from 23.9h to 24.1h. Table 2 probes two hyperparameters: the layer boundary for gradient surgery and the region-weighting coefficient λ. The authors also provide qualitative examples in Fig. 7 and gradient/attention diagnostics in Fig. 2 and Fig. 4 to justify the design choices. Reproducibility is limited by the fact that the excerpt does not mention code release, frozen checkpoints, or public availability of the training/evaluation prompts beyond the benchmark names and project page.

Technical innovations

  • Modality-wise advantage routing separates video, audio, and synchronization rewards instead of collapsing them into one global advantage, unlike vanilla GRPO-style RLVR.
  • Layer-wise gradient surgery partially detaches video-to-audio key/value gradients only in shallow audio layers, preserving deep cross-modal interaction while reducing gradient leakage.
  • Region-wise loss reweighting uses deep V2A cross-attention maps from late denoising steps as an intrinsic mask for synchronization-critical visual regions.
  • The method extends DiffusionNFT-style online diffusion RL to a joint audio-video setting with modality-specific credit assignment and a dual-branch objective.

Datasets

  • JavisBench — size not specified in excerpt — benchmark dataset used for joint audio-video generation evaluation
  • VBench — size not specified in excerpt — public benchmark for video quality evaluation
  • 1,400 generated samples over 175 prompts, group size 8 — author-generated analysis set used for advantage-conflict study

Baselines vs proposed

  • LTX-2: VQ = 2.038 vs proposed: 3.326
  • LTX-2: AQ = 5.197 vs proposed: 5.715
  • LTX-2: DeSync = 0.569 vs proposed: 0.269
  • LTX-2+GDPO: DeSync = 0.412 vs proposed: 0.269
  • LTX-2+GDPO: CLAP = 0.428 vs proposed: 0.445
  • LTX-2+GDPO: AVHScore = 0.223 vs proposed: 0.257
  • LTX-2+vanilla RL: DeSync = 0.412 vs proposed full method: 0.269
  • LTX-2+vanilla RL: JavisScore = 0.185 vs proposed full method: 0.220
  • LTX-2+vanilla RL: VQ = 3.209 vs proposed full method: 3.326
  • LTX-2+vanilla RL: AQ = 5.523 vs proposed full method: 5.715

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12480.

Fig 1

Fig 1: OmniNFT consistently improves the performance of LTX-2 in audio and visual quality,

Fig 2

Fig 2: Advantage inconsistency and asymmetric audio-video interaction. (a) Video and audio

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 7

Fig 7 (page 1).

Fig 8

Fig 8 (page 1).

Limitations

  • The paper does not fully specify dataset splits, prompt sourcing, or whether any evaluation prompts overlap with RL fine-tuning prompts.
  • TV-IB and CLIP do not improve over baseline, and the authors explicitly note that text-video semantic alignment remains challenging.
  • The strongest evidence is on JavisBench and VBench with one backbone family (LTX-2); generalization to other architectures is not demonstrated in the excerpt.
  • The reward models are externally chosen and may encode their own biases; the paper does not report reward-model calibration or sensitivity analysis.
  • The attention-map-based region weighting is heuristic: it assumes deep V2A cross-attention reliably identifies critical regions, but no direct human-annotation validation is shown in the excerpt.

Open questions / follow-ons

  • Would the same modality-wise routing and gradient surgery help other dual-stream multimodal generators, such as speech-video or music-video models, or is it specific to LTX-2’s asymmetric architecture?
  • Can the V2A-attention-derived region mask be validated against human-labeled speaking/object regions, or replaced with a more principled grounding signal?
  • How stable is online OmniNFT training under different reward-model choices, group sizes, or prompt distributions, especially when rewards disagree more strongly than in the reported benchmarks?
  • Can text-video alignment be improved without degrading the synchronization gains, given that TV-IB and CLIP remained largely flat in the reported results?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the paper is useful less as a generation method and more as a case study in multi-objective optimization under conflicting signals. The central lesson is that when several rewards or detectors supervise different parts of a system, collapsing them into one scalar can hide disagreement and create brittle updates. That maps well to abuse detection stacks where you combine visual, behavioral, and network signals: modality-specific routing or branch-specific losses may preserve useful signal better than a single composite score.

The region-reweighting idea is also relevant to captcha design and challenge telemetry. The authors use an internal attention map to identify synchronization-critical regions and then concentrate learning there; analogously, anti-bot systems can try to localize which challenge components actually discriminate humans from automation and weight them more heavily. At the same time, the paper is a reminder that heuristic weighting can overfit to internal artifacts: if a defense model’s “important region” proxy is wrong, you can make the system stronger on benchmark metrics while weakening real-world robustness. So a bot-defense engineer would likely view this as an argument for per-signal diagnostics, careful ablations, and held-out attacker evaluation rather than a single aggregate accuracy number.

Cite

bibtex
@article{arxiv2605_12480,
  title={ OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation },
  author={ Guohui Zhang and XiaoXiao Ma and Jie Huang and Hang Xu and Hu Yu and Siming Fu and Yuming Li and Zeyue Xue and Lin Song and Haoyang Huang and Nan Duan and Feng Zhao },
  journal={arXiv preprint arXiv:2605.12480},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12480}
}

Read the full paper

Articles are CC BY 4.0 — feel free to quote with attribution