An Attention-Based Denoising Model for Diffusion Weighted Imaging

Source: arXiv:2606.03903 · Published 2026-06-02 · By Prithviraj Verma, Pawan Kumar, Chandan Deshani, Prasun Chandra Tripathi

TL;DR

This paper addresses the challenge of denoising diffusion-weighted imaging (DWI) scans corrupted by signal-dependent Rician noise, which arises from magnitude reconstruction and complicates restoration due to spatially varying noise variance and intensity bias. Traditional CNN-based denoisers and classical MRI denoising methods often assume Gaussian noise or rely on handcrafted priors that do not adequately handle this heteroscedastic noise, especially under severe corruption. The authors propose a novel noise-aware denoising framework combining hierarchical Swin Transformer window attention with Restormer-based multi-dimensional gated refinement, integrating explicit noise-level conditioning and residual reconstruction to adaptively suppress Rician noise across multiple noise regimes (1–15%).

The model is trained and evaluated on a large multi-site DWI dataset (Traveling Human Phantom), demonstrating significant quantitative gains over multiple baselines, achieving a mean PSNR of 33.69 dB and SSIM of 0.8539 under varying noise levels while maintaining structural consistency and sharpness. Ablation studies confirm the complementary benefit of combining Swin Transformer spatial attention with Restormer channel-adaptive refinement. Qualitative results show the model preserves fine anatomical details that are critically important for clinical interpretation. This approach represents a significant advance in MRI denoising by explicitly modeling and conditioning on noise level and addressing the specific statistical nature of Rician noise in DWI.

Key findings

Proposed Swin–Restormer model achieves mean PSNR of 33.69 dB and SSIM of 0.8539 on THP DWI dataset across 1–15% Rician noise levels.
Model maintains robust performance even at severe noise levels of 13–15%, e.g., 29.24 dB PSNR and 0.7391 SSIM at 15%.
Compared to a U-Net baseline trained on same noise conditions, the proposed model improves PSNR from 26.27 dB to 33.69 dB and SSIM from 0.6518 to 0.8539 (Table 4).
Swin Transformer only variant achieves 31.61 dB PSNR and 0.8489 SSIM, while Restormer only yields 31.83 dB PSNR and 0.8347 SSIM, showing complementary benefits.
Outperforms classical MRI denoisers (NLM-Rician 27.6 dB PSNR, 0.81 SSIM) and prior CNN methods like DnCNN (28.5 dB PSNR) on same or similar datasets.
Explicit noise-level conditioning with spatially broadcast scalar enables adaptive denoising across multiple Rician noise intensities, improving generalization.
Residual learning framework (predicting noise residual) enhances detail preservation and convergence stability on DWI restoration.
Qualitative visual evaluation demonstrates preservation of edge sharpness and fine anatomical structures under heavy noise corruption.

Threat model

The paper does not target an active adversarial threat but rather addresses the presence of signal-dependent Rician noise arising naturally from MRI magnitude reconstruction and accelerated acquisition protocols. The objective is to denoise corrupted DWI scans exhibiting heteroscedastic noise with spatial and intensity-dependent variance. There is no assumption of malicious interference or attacker capabilities; the noise is treated as an inherent physical corruption to be adaptively suppressed while preserving anatomy.

Methodology — deep read

The threat model assumes a realistic clinical setting where diffusion-weighted images suffer from signal-dependent magnitude reconstruction noise following Rician distribution, characterized by heteroscedastic, spatially varying noise that induces low-intensity bias and spatially heterogeneous variance. The adversary is implicit as the noise corrupts the images during acquisition or acceleration—there is no active attacker.

Dataset: The Traveling Human Phantom (THP) dataset was used, containing repeated DWI acquisitions of five healthy subjects scanned across eight sites, totaling 25,200 2D DWI slices resized to 128×128 pixels. Labels are clean DWI images; corrupted inputs are generated synthetically by adding Rician noise with standard deviation σ uniformly sampled from 1% to 15% of normalized intensity. Data was split 70% training, 10% validation, 20% testing with random 64×64 crops during training to enrich data diversity.

Architecture: The model combines hierarchical Swin Transformer blocks with shifted 8×8 window multi-head self-attention to model long-range spatial dependencies efficiently, reducing complexity. Alongside, two Restormer blocks in the bottleneck apply multi-dimensional transposed attention (modeling adaptive channel inter-dependencies) and gated depthwise convolutional feed-forward networks (GDFN) for nonlinear feature refinement. The input is concatenated noisy image plus a spatially broadcast noise-level scalar map, embedded jointly via a 3×3 convolution into a 64-channel feature space. Encoder uses hierarchical attention layers, bottleneck with Restormer-based channel adaptive refinement, and decoder refines and upsamples features. The network predicts the noise residual, which is subtracted from the noisy input to generate the denoised output.

Training: The model uses a hybrid Charbonnier-SSIM loss balancing pixel fidelity and structural similarity (weight α=0.3). Optimization employed Adam with learning rate 5×10^-4 and cosine annealing. Training ran for 30 epochs, batch size 4, using 64×64 patches on NVIDIA RTX 5050 GPU. Noise σ values were uniformly sampled in [1%,15%] per sample to enforce noise-aware learning.

Evaluation: Performance was measured by PSNR and SSIM on held-out test data corrupted by Rician noise at levels 1%–15%. The model was compared against classical denoisers (non-local means), CNN-based baselines like U-Net, DnCNN, and transformer-based variants limited to either Swin or Restormer components. An ablation study assessed component contributions. Qualitative visual assessment confirmed structural preservation.

Reproducibility: Models and data splits are described in detail. The THP data is a known public multi-center DWI dataset. Exact code release status is not stated in the paper, so reproducibility assumes reimplementation based on descriptions provided. Noise corruption simulation follows a standard Rician model (Equation 6). The training regime and hyperparameters are fully specified.

Technical innovations

Integration of hierarchical Swin Transformer windowed self-attention with Restormer multi-dimensional channel-adaptive gated refinement within a noise-aware denoising architecture.
Explicit noise-level conditioning via spatially broadcast scalar concatenated with input enables adaptive handling of heteroscedastic Rician noise across multiple corruption regimes.
Residual reconstruction framework predicting noise residual rather than direct denoised image enhances stability and detail preservation in DWI restoration.
Hybrid Charbonnier–SSIM loss function balances pixel fidelity and structural consistency optimized for medical image restoration under Rician noise.

Datasets

Traveling Human Phantom (THP) — 25,200 2D DWI slices — public multi-site multi-subject MRI dataset

Baselines vs proposed

UNet Baseline (Rician noise, THP DWI): PSNR = 26.27 dB, SSIM = 0.6518 vs Proposed Swin–Restormer: PSNR = 33.69 dB, SSIM = 0.8539
Swin Transformer Only: PSNR = 31.61 dB, SSIM = 0.8489 vs Proposed: 33.69 dB, 0.8539
Restormer Only: PSNR = 31.83 dB, SSIM = 0.8347 vs Proposed: 33.69 dB, 0.8539
NLM-Rician [12] on DW-MRI: PSNR = 27.6 dB, SSIM = 0.81 vs Proposed: 33.69 dB, 0.8539
Optimized NLM [1] on 3D Brain MRI: PSNR = 28.4 dB, SSIM = 0.83 vs Proposed: 33.69 dB, 0.8539
NLPCA [5] on Brainweb MRI: PSNR = 31.2 dB, SSIM = 0.88 vs Proposed: 33.69 dB, 0.8539
DnCNN [16] (Gaussian noise) Brain MRI: PSNR = 28.5 dB, SSIM = 0.82 vs Proposed: 33.69 dB, 0.8539

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.03903.

Fig 1

Fig 1: The proposed Swin–Restormer architecture showing encoder, bottleneck

Fig 2

Fig 2: Complete denoising workflow from preprocessing through residual recon-

Limitations

Evaluation limited to synthetic Rician noise on the THP dataset; real-world acquisition noise variability or other MRI sequences were not examined.
No explicit adversarial or worst-case noise corruption experiments to test robustness beyond uniform Rician noise levels.
No description of code or model weight release, which may hinder reproducibility and direct application.
Training and evaluation focused on relatively small 2D image patches (64×64), leaving generalization to full 3D volumes or larger resolutions untested.
The study assesses generalization across noise levels but not different scanner hardware or pathological conditions that may alter noise characteristics.

Open questions / follow-ons

How well does the model generalize to real clinical data with diverse acquisition settings, scanner types, and patient pathologies beyond synthetic noise simulation?
Can the approach be extended effectively to 3D volumetric DWI denoising or multi-shell diffusion MRI sequences?
What is the impact of this denoising strategy on downstream diffusion biomarker accuracy, e.g. apparent diffusion coefficient or fractional anisotropy metrics?
How does the model perform under adversarial or non-Rician noise conditions, including motion artifacts and physiological noise variations common in clinical scans?

Why it matters for bot defense

While this work is focused on MRI image denoising and not directly on bot-defense or CAPTCHA challenges, it provides valuable insights into adaptive noise suppression and hierarchical attention mechanisms that could inspire approaches in signal or image restoration in adversarial settings. Bot detection systems sometimes process noisy or corrupted inputs such as distorted CAPTCHAs or sensor data; techniques that combine noise-level conditioning with transformer-based contextual modeling might enhance robustness and generalization. Additionally, the explicit modeling of heteroscedastic and signal-dependent noise using noise conditioning may parallel challenges in modeling non-stationary noise in user or bot behavior signals under adversarial conditions. Engineers could explore adapting hierarchical attention frameworks and noise-aware residual learning architectures to boost denoising or signal cleaning in highly variable input environments encountered in bot-defense scenarios.

Cite

bibtex

@article{arxiv2606_03903,
  title={ An Attention-Based Denoising Model for Diffusion Weighted Imaging },
  author={ Prithviraj Verma and Pawan Kumar and Chandan Deshani and Prasun Chandra Tripathi },
  journal={arXiv preprint arXiv:2606.03903},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.03903}
}

An Attention-Based Denoising Model for Diffusion Weighted Imaging ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​