Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering
Source: arXiv:2604.28136 · Published 2026-04-30 · By Furkan Kınlı
TL;DR
This paper tackles Night Photography Rendering (NPR), a RAW-to-RGB problem where the scene contains both severely underexposed regions and bright point light sources, creating a difficult mix of noise, color casts, and high dynamic range. The core complaint is not that existing methods fail on PSNR/SSIM, but that they often look wrong to humans: they can preserve pixel fidelity while still showing perceptual issues such as color bias, tone artifacts, and detail loss. The author frames this as a mismatch between objective reconstruction metrics and perceptual quality, using the NTIRE 2025 NPR challenge as the motivating benchmark.
The proposed solution, pHVI-ISPNet, is an extension of CIDNet built around the HVI color space. It adds four things: RAW-domain feature processing via an RGGB-to-features block, wavelet-based feature propagation to reduce high-frequency loss, a sample-adaptive dynamic loss weighting scheme based on mean intensity ratios, and a feature-distribution-based loss (FDM) to encourage color/statistical consistency. On the NTIRE 2025 benchmark, the method is competitive on fidelity and reported as state of the art on both CIE2000 color difference and LPIPS, with the ablation table showing each module contributing incrementally to those gains.
Key findings
- On the NTIRE 2025 NPR test set, pHVI-ISPNet reports PSNR = 23.63 and SSIM = 0.785, close to the fidelity leader NJUST KMG at PSNR = 23.82 and SSIM = 0.793.
- pHVI-ISPNet achieves the best reported CIE2000 color difference in Table 1: ∆E = 5.42, versus NJUST KMG at 5.85 and Mialgo at 6.52.
- pHVI-ISPNet achieves the best reported LPIPS in Table 1: 0.388, versus Mialgo at 0.400 and NJUST KMG at 0.433.
- The base HVI-CIDNet ablation starts at PSNR = 21.71 / SSIM = 0.757 / ∆E = 7.06 / LPIPS = 0.418, showing that the added components are responsible for most of the final gain.
- Adding the RGGB-to-Features block gives the largest single jump in reconstruction quality in Table 2: PSNR rises from 21.71 to 23.48 and ∆E drops from 7.06 to 5.61.
- Adding wavelet-based feature propagation further improves the ablation to PSNR = 23.60, ∆E = 5.51, and LPIPS = 0.399, indicating that the multiscale detail-preservation claim is supported by the reported numbers.
- Adding LFDM yields the best LPIPS in the ablation table: 0.379, compared with 0.399 after wavelet propagation alone and 0.418 for the base model.
- The final dynamic loss weighting step (α) gives the best PSNR and ∆E in the ablation: PSNR = 23.63 and ∆E = 5.42, while SSIM remains 0.785.
Threat model
The adversary is not a malicious actor but the imaging condition itself: severely underexposed RAW scenes, high dynamic range, sensor noise, and multiple spectrally diverse light sources that cause color casts and detail loss. The method assumes paired RAW-to-RGB supervision is available and does not consider an attacker who can manipulate the sensor stream, labels, or training data beyond natural capture variation. It also does not address intentional adversarial perturbations or test-time corruption beyond the night-photography domain.
Methodology — deep read
The threat model is not adversarial in the security sense; this is a computational photography rendering system. The implicit adversary is the difficulty of the input distribution itself: extremely dark RAW captures, strong point lights, sensor noise, and multiple illuminants that create color bias and make standard RGB/HSV processing unstable. The method assumes paired training supervision is available, with low-light RAW inputs and aligned high-quality RGB targets, and that the main failure mode to optimize against is perceptual mismatch rather than a malicious attacker.
Data comes from the NTIRE 2025 Night Photography Rendering challenge dataset. The paper says the dataset consists of paired low-light RAW images captured by a Huawei mobile phone and corresponding high-quality RGB images from a Sony camera used as ground truth. For training, the authors use random cropping with 768×768 patches. For evaluation, they use the final test phase set of 200 full-resolution images at 2000×2000. The paper mentions standard preprocessing for ground-truth alignment: black/white-level correction, packing the Bayer mosaic into a 4-channel RGGB tensor I_RAW ∈ R^{H×W×4}, and then perspective transformation plus cropping to ensure pixel alignment. The text does not specify train/validation split details, whether the challenge split was used as-is, or whether any samples were held out beyond the challenge test set.
Architecturally, the model is built on CIDNet, which uses the HVI color space to decouple color and intensity. CIDNet’s core idea is to transform input features with T_HVI into complementary HV (color) and I (luminance) streams, then process them in a dual-branch U-Net. pHVI-ISPNet modifies that base in two structural ways. First, it replaces the usual 3-channel RGB entry point with an RGGB-to-Features block that ingests the 4-channel packed RAW directly and uses learned convolutions to perform feature-level demosaicking and sensor-noise suppression before HVI conversion. Second, it replaces standard down/up-sampling with wavelet-based propagation: encoder downsampling uses a discrete wavelet transform (DWT), all four sub-bands are retained and fused with a 1×1 convolution, and decoder upsampling uses inverse DWT (IDWT) with split low-pass/high-pass components, then concatenation into skip connections. The novelty here is not the wavelet transform itself, but the choice to preserve high-frequency information throughout the encoder-decoder path instead of discarding it via pooling or strided convolutions.
The training objective is a composite loss: L_Total = L_p + λ_∆E·L_∆E + λ_FDM·L_FDM. The reconstruction term L_p is itself a weighted sum over RGB and HVI domains of L1, SSIM, and edge loss terms. The weighting is sample-dependent: for each image i, α_i = max(μ_GT/μ_pred, μ_pred/μ_GT), where μ is the global mean intensity, so the loss increases when the predicted image is too dark or too bright relative to ground truth. This dynamic coefficient is applied across the batch as L_p = (1/N) Σ_i α_i · Σ_{I∈{IRGB,IHVI}} (L1(I)+LSSIM(I)+LE(I)). The perceptual/color terms are CIE2000 color-difference loss L_∆E and Feature Distribution Matching loss L_FDM, which is described as derived from Exact Feature Distribution Matching. The paper’s formula for LFDM compares predicted feature vectors to ranked ground-truth feature vectors; it specifically mentions that features can be things like ViT [CLS] tokens, although the exact internal feature source for this paper is not fully spelled out. In practice, this loss is intended to align empirical feature distributions so that color/style consistency survives multi-illuminant variation.
Training and evaluation are straightforward but the paper gives limited reproducibility detail. The model is implemented in PyTorch and trained on 2× NVIDIA RTX 2080Ti GPUs for 2500 epochs with batch size 2. Optimization uses AdamW with initial learning rate 2e-4, linear warmup for the first 3 epochs, then cosine annealing down to 1e-5. The paper does not report seed strategy, weight decay value, mixed precision, gradient clipping, or early stopping. Evaluation uses PSNR, SSIM, LPIPS, and CIE2000 (∆E). The main comparison table includes challenge participants such as NJUST KMG, Mialgo, PSU, and others, and the ablation study increments from base HVI-CIDNet through RGGB-to-Features, wavelet propagation, LFDM, and dynamic loss weighting. The strongest concrete example is the final reported model, which goes from a packed RAW Bayer input through learned feature extraction, into the HVI dual-branch network with wavelet-preserving multiscale processing, and is supervised by mixed reconstruction plus perceptual/color losses; that configuration yields the best reported ∆E and LPIPS in Table 1.
Reproducibility is partial. The paper cites the NTIRE 2025 challenge dataset and reports the training setup, but it does not provide code, pretrained weights, or a public release in the text excerpt. It also relies on a challenge dataset whose exact availability and license are not described here. The exact architecture depth, channel counts, and some implementation details of the RGGB-to-Features block and HVI integration are not fully enumerated in the text, so a reader would need the code or supplementary material to reimplement it faithfully.
Technical innovations
- Directly processes packed 4-channel RAW Bayer input with a learned RGGB-to-Features front end instead of converting immediately to a 3-channel representation.
- Replaces conventional pooling/strided convolution with DWT/IDWT-based feature propagation to preserve high-frequency detail across the encoder-decoder.
- Introduces a sample-adaptive loss coefficient α_i based on the ratio of predicted and ground-truth mean intensities to stabilize learning across exposure extremes.
- Adds CIE2000 color-difference supervision and feature distribution matching loss on top of RGB/HVI reconstruction losses to explicitly target perceptual realism and color constancy.
Datasets
- NTIRE 2025 Night Photography Rendering benchmark — 200 full-resolution test images (2000×2000); paired training set size not specified in the paper excerpt — source: NTIRE 2025 challenge dataset
- Paired low-light RAW captures from Huawei phone and corresponding Sony RGB ground truth — size not specified in the paper excerpt — source: NTIRE 2025 challenge dataset
Baselines vs proposed
- NJUST KMG: PSNR = 23.82 vs proposed: 23.63
- NJUST KMG: SSIM = 0.793 vs proposed: 0.785
- NJUST KMG: ∆E = 5.85 vs proposed: 5.42
- NJUST KMG: LPIPS = 0.433 vs proposed: 0.388
- Mialgo: ∆E = 6.52 vs proposed: 5.42
- Mialgo: LPIPS = 0.400 vs proposed: 0.388
- Base HVI-CIDNet: PSNR = 21.71 vs proposed: 23.63
- Base HVI-CIDNet: SSIM = 0.757 vs proposed: 0.785
- Base HVI-CIDNet: ∆E = 7.06 vs proposed: 5.42
- Base HVI-CIDNet: LPIPS = 0.418 vs proposed: 0.388
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.28136.

Fig 1: Perceptual gap in challenging night-time scenes.

Fig 2: The pHVI-ISPNet Architecture. Our network extends the base CIDNet [2] dual-branch structure with two key structural

Fig 3 (page 1).

Fig 4 (page 1).

Fig 5 (page 1).

Fig 6 (page 1).

Fig 7 (page 1).

Fig 8 (page 1).
Limitations
- The method is evaluated on a single benchmark challenge dataset; there is no cross-dataset generalization test to other night-scene RAW collections.
- The paper emphasizes challenge-test performance but does not report statistical significance, confidence intervals, or variance over multiple runs.
- Implementation details are incomplete for exact reproduction: the full network depth, feature widths, and some loss weights (λ_∆E, λ_FDM) are not given in the excerpt.
- The dynamic loss weighting slightly worsens LPIPS in the final ablation compared with the LFDM-only configuration (0.388 vs 0.379 in Table 2), suggesting a trade-off that is not deeply analyzed.
- No adversarial robustness, sensor-shift robustness, or out-of-distribution illuminant tests are reported.
- The paper does not describe code release or pretrained checkpoint availability in the provided text.
Open questions / follow-ons
- How well does the HVI + RAW-domain + wavelet design transfer to other low-light RAW datasets with different sensors, CFA patterns, or color matrices?
- Would the dynamic loss weighting still help if the training distribution included more extreme exposure imbalance or stronger temporal noise, such as burst night photography?
- Can the feature-distribution loss be replaced or combined with a stronger perceptual objective that more directly correlates with human preference in night scenes?
- How sensitive are the reported gains to the exact λ weights and to the choice of feature layer used by FDM?
Why it matters for bot defense
For a bot-defense engineer, the main takeaway is less about the image-restoration task itself and more about the modeling pattern: explicitly separating color/intensity structure, preserving high-frequency information through the network, and using perceptual rather than purely pixel-based objectives. In CAPTCHA or anti-bot settings where visual realism matters, the paper is a reminder that optimizing a fidelity metric can still produce outputs that look wrong to humans or to downstream classifiers. The RAW-domain details do not transfer directly, but the broader lesson does: if a rendering or synthesis pipeline must survive human scrutiny, you may need losses and architectures that target perceptual consistency, color constancy, and multi-scale detail preservation, not just average reconstruction error.
Cite
@article{arxiv2604_28136,
title={ Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering },
author={ Furkan Kınlı },
journal={arXiv preprint arXiv:2604.28136},
year={ 2026 },
url={https://arxiv.org/abs/2604.28136}
}