Supercharging Thermal Gaussian Splatting with Depth Estimation

Source: arXiv:2605.30328 · Published 2026-05-28 · By Manoj Biswanath, Chenxin Cai, Hannah Schieber, Daniel Roth, Benjamin Busam

TL;DR

This paper addresses the challenge of robust 3D scene representation and novel view synthesis (NVS) using purely thermal infrared imagery augmented with depth estimation from the same thermal modality, removing dependence on RGB images. The authors propose Thermal-to-Depth Gaussian Splatting (TDg), a method that integrates monocular depth estimation directly into a unified Gaussian Splatting framework optimized only on thermal images. This avoids the complexity and slower convergence caused by multi-modal fusion seen in prior works that combine RGB, thermal, and depth data. Their method achieves better or comparable rendering quality on standard thermal datasets RGBT-Scenes and ThermalMix compared to the MSMG baseline multispectral Gaussian Splatting approach, while reducing training time by approximately 55% on average.

TDg leverages a progressive joint loss combining thermal photometric and depth rendering losses, where depth loss weight decays over training to first enforce geometry and then refine thermal appearance. The depth comes from a monocular thermal-depth estimator trained independently and used only for supervision. Importantly, the unified Gaussian model is optimized without separate streams for each modality, simplifying the architecture. Experimental results demonstrate improved PSNR (+0.24), SSIM (+0.0003), and LPIPS (-0.002) on average over MSMG, with consistent runtime gains due to geometric constraints from depth supervision guiding faster convergence. The paper also discusses challenges such as reliance on COLMAP RGB-based initialization for sparse point clouds and failure cases involving occluded or unseen regions with sparse views.

Key findings

The TDg method improves average PSNR from 27.18 to 27.42 (+0.24) compared to MSMG baseline on combined RGBT-Scenes and ThermalMix datasets.
Average SSIM improves slightly from 0.8896 to 0.8899 (+0.0003), and LPIPS decreases from 0.177 to 0.175 (-0.002), indicating improved visual fidelity.
TDg reduces average training time per scene from approximately 0.389 hours to 0.176 hours, a 55% reduction, accelerating convergence significantly.
TDg outperforms MSMG in all scenes of RGBT-Scenes dataset except for one SSIM case; ThermalMix results are more mixed but runtime improves universally.
Progressive joint optimization with depth decay enables transitioning from geometry-guided to photometric-focused training, stabilizing optimization.
Use of monocular depth estimation from thermal images only enables single-modality learning, avoiding misalignment and optimization conflicts of multimodal fusion.
Ablation shows models with random initialization of point clouds achieve severe performance degradation (PSNR < 10) compared to COLMAP initialization (~27 PSNR).
Failure cases arise in reconstruction of large-scale or unseen areas with insufficient views, where the depth-guided method cannot densify Gaussians effectively.

Threat model

n/a — No explicit security threat model or adversary assumptions are discussed, as the paper focuses on efficient and accurate thermal-based 3D scene reconstruction and novel view synthesis.

Methodology — deep read

The paper proposes Thermal-to-Depth Gaussian Splatting (TDg) which learns a unified 3D Gaussian radiance field solely from thermal images supplemented by monocular depth estimation derived from the same thermal modality. The approach is motivated by challenges of multimodal fusion (RGB + thermal + depth) causing slower convergence and optimization conflicts.

Threat model & assumptions: The adversary context is not explicitly security-related, but robustness here focuses on performance in thermally challenging scenes without RGB reliance. The system assumes access to calibrated thermal multi-view images with known camera poses obtained from COLMAP SfM initialized from RGB images for sparse point clouds. Monocular depth estimation network from thermal images provides depth priors.
Data: Experiments use RGBT-Scenes (9 scenes) and ThermalMix (5 scenes) datasets, each with 640×480 thermal images paired with RGB. Standard train-test split 80%-20% is followed. Depth maps for supervision are precomputed using a thermal monocular depth estimator (Marigold method). Poses and sparse point clouds come from COLMAP RGB SfM, as thermal SfM is unreliable.
Architecture/algorithm: TDg represents the scene as N 3D Gaussians parameterized by position (µi), covariance (Σi), thermal radiance feature (ci), and opacity (αi). A single unified set of Gaussians encodes geometry and thermal radiance simultaneously. Rendering produces both thermal images and depth maps via differentiable Gaussian rasterization. Depth supervision aligns Gaussian geometry towards depth estimator outputs. The loss is a weighted sum of (a) thermal photometric loss combining L1, SSIM, and smoothness regularization; (b) depth loss combining L1 and SSIM between rendered and estimated depth, normalized to [0,1]. A progressive decay factor gradually reduces depth loss weight during training.
Training regime: Implemented in PyTorch with CUDA accelerated rasterization on an NVIDIA RTX 3060 GPU. Training iterations range 10k-30k (less than baseline). Progressive depth loss weight decay starts at first iteration and fades out by half the total iterations. Batch size unspecified but runs per scene. Hyperparameters for losses mostly inherited from prior Gaussian Splatting implementations.
Evaluation protocol: Benchmarks rely on PSNR, SSIM, and LPIPS comparing rendered thermal images to ground truth. Training time per scene is measured for efficiency. Comparisons made against MSMG (multi-modal Gaussian splatting) baseline from Lu et al., 2024. Ablations on point cloud initialization (random vs COLMAP) test importance of geometric priors. Qualitative visualizations on challenging thermal scenes validate fidelity.
Reproducibility: Code and experimental scripts are publicly available at https://hannahhaensen.github.io/TDg/. Depth networks used for supervision are cited but probably static pretrained models. Sparse point clouds come from COLMAP RGB-based SfM, which is not replaced by thermal-only SfM at present.

Concrete example end-to-end: Starting from thermal multi-view images and RGB SfM poses, a thermal monocular depth estimator predicts depth maps. These depth maps and thermal images supervise a single unified 3D Gaussian splatting model representing the scene. The model is optimized minimizing loss that combines thermal reconstruction fidelity and depth map alignment with progressive decay of depth loss. After <30k iterations, the model produces novel thermal views and depth predictions with higher PSNR/SSIM and faster convergence than MSMG baseline that relies on multimodal Gaussian sets.

Technical innovations

A unified 3D Gaussian representation jointly optimized for thermal radiance and depth rendered from thermal-only input, removing RGB reliance common in prior works.
Progressive joint optimization strategy with depth loss weight decay to first impose geometric constraints then focus on thermal photometric fidelity.
Monocular depth estimation from thermal images integrated as a geometric supervisory signal within Gaussian Splatting, guiding stable and efficient convergence.
Demonstration that single-modality thermal NVS can achieve comparable or better reconstruction quality and 55% reduction in training time compared to multi-modal MSMG baseline.

Datasets

RGBT-Scenes — 9 scenes — public thermal-RGB multimodal benchmark
ThermalMix — 5 scenes — public thermal-RGB multimodal benchmark

Baselines vs proposed

MSMG (Lu et al., 2024): PSNR = 27.18 vs TDg: 27.42
MSMG: SSIM = 0.8896 vs TDg: 0.8899
MSMG: LPIPS = 0.177 vs TDg: 0.175
MSMG: Training time = 0.389 hours/scene vs TDg: 0.176 hours/scene

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.30328.

Fig 1

Fig 1: Thermal images rendered from the GS model of our

Fig 2

Fig 2: TDg architecture. A unified 3D Gaussian representation (center) is optimized via dual rasterization. By comparing the

Fig 3

Fig 3: Rendering image results on the dataset scenes: (a)

Fig 4

Fig 4: Building reconstruction: The front view was correctly

Fig 5

Fig 5 (page 7).

Limitations

Relies on COLMAP SfM point cloud initialization from RGB images for geometric scaffolding; thermal-only SfM not feasible currently.
Random sparse point cloud initialization leads to severe overfitting and poor generalization (PSNR ~9), limiting thermal-only pipeline autonomy.
Limited ability to reconstruct large-scale or occluded regions with sparse or incomplete multi-view coverage.
Depth estimates from thermal images suffer from scale and shift ambiguities needing normalization and act only as supervisory priors, possibly noisy.
No explicit adversarial robustness or distribution shift evaluations under varying weather or thermal sensor noise conditions.
Optimization and efficiency gains demonstrated on mid-scale indoor/outdoor scenes; scalability to highly complex outdoor settings not evaluated.

Open questions / follow-ons

Can thermal-only SfM or feature extraction methods be developed to remove RGB dependence for sparse point cloud initialization?
How can monocular thermal depth estimation accuracy and robustness be improved, especially under challenging conditions or noisy sensors?
What techniques can mitigate failure in reconstructing large, occluded, or under-sampled areas with sparse thermal multi-view data?
How well do thermal-only Gaussian splatting models generalize to outdoor or dynamic scenes with complex thermal patterns and changing environments?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners interested in multi-modal sensing or anti-spoofing via thermal imagery, this paper demonstrates that single-modality thermal-based 3D scene reconstructions can achieve comparable quality to multi-modal RGB-thermal methods while being more efficient. The insights on progressive optimization using depth supervision and unified Gaussian representations could inform the design of thermal-based challenge-response systems or threat detection pipelines that rely solely on thermal inputs, improving robustness in low-visibility conditions without needing RGB cameras.

However, the approach still relies on auxiliary RGB-based initialization, which may limit full deployment in strictly thermal-only hardware. The depth estimation from thermal is noisy and used only to regularize geometry, so robustness to adversarial thermal patterns or spoofing attempts remains an open question. Practitioners should consider the improved efficiency tradeoffs demonstrated here when incorporating thermal NVS into real-time or embedded anti-bot vision systems.

Cite

bibtex

@article{arxiv2605_30328,
  title={ Supercharging Thermal Gaussian Splatting with Depth Estimation },
  author={ Manoj Biswanath and Chenxin Cai and Hannah Schieber and Daniel Roth and Benjamin Busam },
  journal={arXiv preprint arXiv:2605.30328},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.30328}
}

Supercharging Thermal Gaussian Splatting with Depth Estimation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​