AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference
Source: arXiv:2606.11186 · Published 2026-06-09 · By Hangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu, Wenqi Shao, Ying Fu
TL;DR
This paper addresses the challenge of low-light video enhancement (LLVE), specifically the difficulty of enhancing severely degraded videos captured under low illumination. Recent multimodal LLVE methods that incorporate auxiliary modalities such as event streams and infrared images have achieved superior enhancement quality. However, these approaches generally assume auxiliary modalities are available at inference time, which is often unrealistic due to sensor costs, calibration, and synchronization issues. To overcome this, the authors propose AnyModNet (AMNet), a unified multimodal framework that enables modality-agnostic inference, allowing auxiliary modalities to be optionally used or absent at test time. The core innovation is the Spatial-Spectral Dual-Gated (S2DG) Translator, which learns to synthesize reliable implicit auxiliary modality representations from low-light RGB inputs by leveraging spatial illumination cues and frequency-domain selection. Additionally, they perform large-scale multimodal pretraining on pseudo multimodal data generated via synthesis from RGB-only videos to improve cross-modal correspondence learning.
Extensive experiments on multiple LLVE benchmarks show that AMNet achieves state-of-the-art performance even with RGB-only inputs, outperforming prior RGB-only and multimodal methods. When auxiliary modalities are available, AMNet further improves results. The model is robust to arbitrary modality combinations at test time, showing minimal performance degradation when auxiliary inputs are missing. Ablations confirm the importance of the dual gating modules, and pretraining scale positively correlates with implicit modality feature quality and final enhancement metrics. This work demonstrates a practical path to improved LLVE by blending multimodal training with robust missing modality inference.
Key findings
- AMNet improves PSNR by 1.47dB and SSIM by 0.02 on the DID dataset compared to prior best RGB-only methods (Table 1).
- On RGB-only inference, AMNet outperforms multimodal baselines relying on event streams by 0.4-1dB in PSNR on the SDE dataset (Table 2).
- AMNet supports inference with any subset of auxiliary modalities (infrared, event streams) with minimal performance drop relative to full multimodal input.
- Large-scale multimodal pretraining with 100% synthetic auxiliary data yields up to +1.8dB PSNR gain and reduces L2 distance between synthetic and real auxiliary features (Table 4).
- The Spatial-Spectral Dual-Gated Translator's two components (Illumination-Aware Detail Selector and Frequency-Band Selector) both individually improve performance by 0.2-0.7dB and jointly by up to 1.7dB PSNR (Table 5).
- In zero-shot evaluation without fine-tuning, AMNet surpasses recent restoration foundation models by ~3-6dB PSNR on LLVE datasets (Table 3).
- AMNet achieves consistent RGB-only enhancement across diverse datasets covering indoor/outdoor and event modalities.
- Implicit auxiliary features synthesized by S2DG preserve fine spatial details and global semantic context despite severe low-light degradation (visualized in Fig. 6).
Threat model
The paper implicitly assumes an operational context where auxiliary modalities (event streams, infrared images) may be unavailable or corrupted at inference time due to hardware or environmental constraints. The 'adversary' scenario corresponds to missing or incomplete sensor inputs, not an active malicious adversary attempting to subvert the enhancement. The system cannot rely on auxiliary sensors always being present, so robustness to modality absence is essential. There is no explicit threat model involving attackers, attacks, or defenses beyond missing sensor data.
Methodology — deep read
Threat Model & Assumptions: The adversary or deployment scenario involves inputting degraded low-light RGB videos with potentially missing auxiliary modalities such as event streams and infrared images. Auxiliary modalities provide additional structural and temporal cues but cannot be assumed always available at test time. The system must handle arbitrary modality absence robustly.
Data: Training utilizes several public high-quality video datasets spanning diverse scenes and motions. From these, pseudo multimodal data are synthesized by generating event streams (via v2e) and infrared images (using ThermalGen) conditioned on paired RGB videos. Illumination is artificially degraded to low-light conditions (1%-50% brightness) to simulate challenging scenarios. Real LLVE datasets used for evaluation include DID (413 paired videos), SDSD (indoor and outdoor subsets), and SDE, an event-based multimodal LLVE dataset.
Architecture: AMNet consists of an RGB encoder extracting multi-scale features from low-light input frames, modality encoders for each auxiliary type (event voxel grids and infrared images), a Spatial-Spectral Dual-Gated (S2DG) Translator that generates implicit auxiliary modality features from RGB features when auxiliary inputs are missing, a fusion module combining explicit and implicit features, a ConvLSTM for temporal modeling, and a decoder predicting a residual enhancement map. The S2DG Translator includes two key modules: an Illumination-Aware Detail Selector (IADS) that computes a spatial reliability mask from low-frequency illumination features to downweight noise-dominated regions, and a Frequency-Band Selector (FBS) operating in the frequency domain via FFT to gate and scale spectral components, emphasizing informative frequency bands. A residual connection fuses low-frequency RGB features to preserve global context.
Training Regime: The model is pretrained on large-scale pseudo multimodal data with varying illumination, then fine-tuned on target datasets. Training clips have length 8 frames, cropped to 128×128 resolution with augmentation. AdamW optimizer is used, starting learning rate 2e-4, cosine scheduler with 5 warm-up epochs. Multi-GPU distributed training is employed for pretraining; inference is single GPU. During training, modality absence is simulated by masking auxiliary inputs randomly or fully, ensuring robustness. Losses include pixel-wise L1 and SSIM reconstruction losses on enhanced frames, and feature-level distillation losses aligning implicit generated modality features with real modality features.
Evaluation Protocol: Metrics are PSNR and SSIM compared to normal-light ground truth videos. Experiments evaluate AMNet under varying inference modality combinations, including RGB-only, RGB plus event, infrared, or both auxiliary modalities. Baselines include prior RGB-only LLVE methods, recent multimodal LLVE models (requiring auxiliary input), and foundation models for image/video restoration evaluated zero-shot. Ablations analyze the contribution of S2DG components and pretraining scale. Performance under distribution shift and adversarial attack scenarios is not reported.
Reproducibility: The authors provide code and pretrained models on the project page. The synthesized pseudo multimodal data pipeline uses referenced external generators (v2e, ThermalGen). Some details of hyperparameters and dataset splits follow prior benchmarks. No closed datasets are used.
An example inference workflow: Given an 8-frame low-light RGB clip without auxiliary inputs, RGB encoder extracts features which are input to S2DG Translator. The IADS module estimates spatial masks from low-frequency illumination features highlighting reliable regions. Then, FBS transforms high-frequency RGB features to frequency domain, gates/scales frequency bands, and inverse transforms. The result plus low-frequency global features form implicit auxiliary representations. The decoder then fuses these with RGB features and temporal context to predict residual maps to reconstruct the enhanced video frames. This flexible pipeline works even if auxiliary modalities are missing or corrupt.
Technical innovations
- Introduction of a Spatial-Spectral Dual-Gated (S2DG) Translator that synthesizes implicit auxiliary modality features from severely degraded RGB inputs by combining spatial illumination-aware gating and frequency-domain selection.
- Unified multimodal LLVE framework (AMNet) supporting modality-agnostic inference — robust to arbitrary auxiliary modality absence at test time.
- Large-scale multimodal pretraining with synthesizing auxiliary event and infrared modalities from RGB-only videos to teach robust cross-modal correspondence.
- End-to-end training combining reconstruction losses under all modality availability combinations plus feature distillation to align implicit modality representations with real features.
Datasets
- DID — 41,038 paired low-/normal-light video frames — public LLVE benchmark
- SDSD (Indoor/Outdoor) — video pairs — public LLVE benchmark
- SDE — 145,888 frames with synchronized event streams — public multimodal LLVE dataset
- Multiple large-scale public video datasets (unnamed) used as pretraining sources
- Synthetic auxiliary modalities generated using v2e and ThermalGen tools
Baselines vs proposed
- STCD: DID PSNR=30.10dB vs AMNet: 31.57dB (RGB-only setting)
- RetinexFormer: DID PSNR=25.40dB vs AMNet: 31.57dB
- EvLight++ (R only): SDE-Indoor PSNR=23.04dB vs AMNet (R only): 23.22dB
- EvLight++ (R+E): SDE-Indoor PSNR=23.22dB vs AMNet (R+E): 23.25dB
- Zero-TIG zero-shot: DID PSNR=19.69dB vs AMNet zero-shot: 25.07dB
- L2 distance between synthetic and real auxiliary features decreases from 0.328 to 0.289 with full pretraining scale, corresponding to +1.79dB PSNR gain on DID
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.11186.

Fig 1: Comparison of different LLVE paradigms under missing-

Fig 2: illustrates the proposed AMNet, a unified mul-

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- Synthetic auxiliary modalities rely on generative models whose fidelity may limit cross-modal correlation representativeness.
- No reported evaluation under real-world sensor noise, partial modality corruption, or asynchronous modalities.
- Performance under changing illumination distributions or unseen scenarios beyond benchmarks is unclear.
- Unclear how well the model scales to higher resolution videos or longer temporal sequences.
- Computational overhead of S2DG Translator and multimodal fusion modules is not detailed for real-time scenarios.
- No adversarial robustness or security threat assessment related to modality spoofing or manipulation is provided.
Open questions / follow-ons
- How does AMNet perform under partial modality corruption or misalignment rather than complete absence of auxiliary modalities?
- Can the implicit modality representation synthesis be extended to other modalities beyond event and infrared (e.g., depth, thermal, radar)?
- What is the computational cost and latency impact of the S2DG Translator in resource-constrained or real-time applications?
- How robust is the model to distribution shifts in illumination, noise characteristics, or motion dynamics beyond the synthetic pretraining data?
Why it matters for bot defense
Though the paper focuses on low-light video enhancement, its modality-agnostic inference paradigm directly relates to bot-defense applications where sensor data or auxiliary signals may be incomplete, corrupted, or inconsistent at inference time. The concept of synthesizing implicit auxiliary representations from degraded primary inputs could inspire more robust CAPTCHA or bot-detection systems that fuse multiple signals (e.g., behavior, network data, biometrics) but gracefully degrade when some modalities are missing. The Spatial-Spectral Dual-Gated Translator's approach to selectively gating information based on reliability estimates and frequency analysis may offer techniques to detect or mitigate spoofing attacks or noisy inputs in CAPTCHA validation pipelines. Overall, the robust multimodal learning and flexible inference capability addressed here align with needs in real-world security systems requiring resilient input fusion under adversarial or noisy conditions.
Cite
@article{arxiv2606_11186,
title={ AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference },
author={ Hangfeng Liang and Yutao Hu and Yanhan Hu and Xiaohan Wu and Wenqi Shao and Ying Fu },
journal={arXiv preprint arXiv:2606.11186},
year={ 2026 },
url={https://arxiv.org/abs/2606.11186}
}