DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

Source: arXiv:2606.12368 · Published 2026-06-10 · By Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

TL;DR

This work addresses the unsolved challenge of achieving generalized metric monocular depth estimation for both narrow field-of-view (FoV) perspective images and 360° panoramic images within a unified framework. Existing methods typically specialize for either perspective or panoramic domains, struggling to handle the stark geometric differences between linear pinhole and spherical equirectangular projections, and are hampered by the scarcity of panoramic training data. DepthMaster introduces a novel approach that decomposes panoramic images into overlapping perspective patches and injects virtual camera pose conditioning as geometric priors, enabling the use of a standard Transformer backbone without specialized architectural modifications. A novel Correspondence Consistency Loss (CCL) enforces geometric coherence across overlapping patches, elegantly addressing boundary artifacts without disrupting the backbone structure.

Trained on 15 datasets including only one panoramic dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 unseen datasets spanning both perspective and panoramic domains. It surpasses specialist methods such as Depth Anything for perspective images and Depth Any Panoramas (DAP) for panoramas, while also beating universal competitors like UniK3D. The method excels at preserving fine details, global geometric consistency, and sharp boundaries, despite using significantly less panoramic training data. Extensive ablations confirm the critical contribution of CCL and camera pose conditioning for panoramic depth accuracy. This unified canonicalization of diverse camera geometries with geometric prior injection presents a scalable paradigm for robust metric depth estimation across domains.

Key findings

DepthMaster reduces AbsRel error on panoramic datasets (Stanford2D3D, Matterport3D, PanoSUNCG) from 6.62 (baseline DA2) to 4.77 on scale-invariant depth (Table 1).
For metric depth on panoramic images, DepthMaster outperforms specialist DAP by reducing AbsRel from 11.79 to 7.25.
On Matterport3D panoramic dataset, DepthMaster achieves 5.79 AbsRel vs UniK3D’s 8.14, despite UniK3D being trained on the dataset.
On 10 diverse perspective datasets, DepthMaster attains best average rank (1.31) surpassing MoGe v2 (2.22), and top 𝛿1 accuracy of 93.9%
Removing Correspondence Consistency Loss (CCL) increases panorama AbsRel from 6.31 to 7.41 and lowers 𝛿1 from 93.3% to 92.4%.
Omitting virtual camera pose conditioning raises panoramic AbsRel from 6.31 to 14.36 and hurts 𝛿1 accuracy from 93.3% to 85.8%.
Training on panoramic only data without perspective datasets causes perspective AbsRel to degrade from 7.15 to 16.34 and panoramic AbsRel from 6.31 to 8.28.
DepthMaster uses only 35K panorama samples vs 1.9M for DAP, demonstrating significant data efficiency.

Threat model

n/a — The paper focuses on improving monocular depth estimation generalization across camera domains rather than adversarial or security scenarios.

Methodology — deep read

Threat Model & Assumptions: The paper considers the monocular depth estimation setting where the input is a single RGB image from either a standard perspective camera or a 360° panoramic camera. The adversarial aspect is minimal here (n/a), focusing on generalization across diverse camera types and domains. Assumptions include access to metric ground truth depth for supervised training, but only one panoramic dataset is included among many perspective datasets.
Data: Training data comprises 15 datasets: 14 perspective datasets and 1 panoramic dataset (Structured3D). The authors refine ground truth for real-world data to reduce noise. Evaluation is zero-shot on 13 held-out datasets with 10 perspective and 3 panoramic image sets. Splits and detailed statistics are in appendices. They emphasize smaller training data scale than prior larger models like MoGe, DepthAnything, and UniK3D.
Architecture / Algorithm: DepthMaster’s core idea is canonicalizing panoramic images into six overlapping 95° FoV perspective patches with 2.5° overlap per edge, creating a unified perspective representation for both image types. Virtual projection cameras’ spatial layout (intrinsics and extrinsics) is embedded as conditioning tokens injected into the Transformer backbone. The backbone is based on DINOv2 ViT-L/14 with 24 blocks; layers 1-8 use standard self-attention, and layers 9-24 alternate between frame-wise and global attention to capture both local patch structure and inter-patch spatial dependencies. This avoids any architectural modification for panorama input.

Two dense prediction heads follow the backbone: a) Point map head: predicts 3D point cloud coordinates and a validity mask. b) Surface normal head: predicts 3-channel normalized surface normals. A separate scale prediction head predicts global depth scale factor using a pooled CLS token from all patches to ensure scale coherence across patches.

The key novel loss is the Correspondence Consistency Loss (CCL) that enforces feature-level consistency across overlapping regions of patches. Patch correspondences are computed using relative rotations between virtual projection cameras and the same intrinsic matrix. CCL aggregates L2 distances between corresponding high-dimensional backbone features at multiple scales to align overlapping areas without architectural changes.

Training Regime: Specifics such as epochs, batch size, optimizer details are in appendix (not fully clear in main text). Training occurs jointly on mixed perspective and panoramic patches with camera conditioning injected for panoramic patches and none for perspective patches. The model benefits from huge perspective datasets plus one panoramic dataset. The CCL loss is applied only on panoramic overlapping regions.
Evaluation Protocol: Zero-shot metric and relative depth evaluation is conducted on 13 unseen datasets with no fine-tuning. Panoramic depth is reprojected from the cubemap patches to equirectangular format for fair comparison. Metrics include AbsRel error, RMSE, and accuracy thresholds (𝛿1, 𝛿2). Baselines include specialist models like DAP and DA360 and universal models UniK3D and DAC. Ablations vary key components such as CCL, virtual camera conditioning, canonicalization, and training data domain.
Reproducibility: The authors publish source code, model weights, and interactive demos publicly on their project page. Structured3D is the publicly used panorama training dataset, while others are standard datasets with generally open access. Full training details including hyperparameters are deferred to appendices.

Concrete example: A panorama image is projected into six overlapping perspective patches with a slight 2.5° FoV overlap on each side to enable boundary consistency. Each patch is processed as a separate input into the Transformer backbone with injected 7D extrinsic and 4D intrinsic pose tokens representing the virtual camera's orientation and parameters. The model predicts a dense 3D point cloud, normal map, and validity mask per patch. Using the CCL loss at layers 11, 15, 19, 23 enforces feature correspondence between overlapping regions among patches. At inference, predicted patches are merged by reprojection and weighted averaging to form a globally consistent depth map. The predicted scale head outputs a single global scale factor applied uniformly to all patches, avoiding boundary scale discontinuities.

Technical innovations

Reformulating panoramic depth estimation as inference on overlapping perspective patches with canonicalized geometry, enabling unified processing of perspective and panoramic images.
Introduction of Correspondence Consistency Loss (CCL), a novel feature-level dense correspondence loss across overlapping patches to enforce seamless geometric fusion without architectural modifications.
Injection of virtual projection cameras’ intrinsic and extrinsic pose embeddings as conditioning tokens into a standard Transformer backbone to provide geometric priors and resolve panorama-perspective discrepancies.
Use of alternating attention layers in Transformer to capture both intra-patch local and inter-patch global spatial coherence.

Datasets

Structured3D — 35K panoramic training samples — public synthetic panorama dataset
Plus 14 perspective datasets totaling over hundreds of thousands of images (exact numbers in appendix), including NYUv2, KITTI, ETH3D, iBims-1, and others

Baselines vs proposed

DA2 (panoramic, scale-invariant AbsRel) = 6.62 avg vs DepthMaster = 4.77
DA360 (panoramic, affine-invariant AbsRel) = 6.39 avg vs DepthMaster = 4.58
DAP (panoramic, metric absRel) = 11.79 avg vs DepthMaster = 7.25
UniK3D (panoramic Matterport3D AbsRel) = 8.14 vs DepthMaster = 5.79
MoGe 2 (perspective average rank) = 2.22 vs DepthMaster = 1.31
Naive baseline (no canonicalization) panorama AbsRel = 9.45 vs DepthMaster = 6.31
w/o CCL panorama AbsRel = 7.41 vs DepthMaster = 6.31
w/o camera pose panorama AbsRel = 14.36 vs DepthMaster = 6.31
Panoramic only training panorama AbsRel = 8.28 vs mixed training DepthMaster = 6.31

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12368.

Fig 1

Fig 1: DepthMaster enables high-fidelity zero-shot metric depth estimation. We visualize the 3D point clouds

Fig 2

Fig 2: Comparison of different projection strategies. (a) Equirectangular. (b) Tangent projection with 18 overlapping

Fig 3

Fig 3: The pipeline of DepthMaster. Perspective and panoramic images are unified as perspective patches and

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Generalization remains limited for highly out-of-distribution and abstract artistic images (e.g., ink wash paintings).
Only one panoramic dataset used for training limits diversity; panoramic data scarcity remains a fundamental bottleneck.
Inference latency is higher than some competing methods (531 ms vs 427 ms on A100 GPU at 1024x2048), which may impact real-time applications.
The method presumes known or approximated virtual camera parameters for patch canonicalization; errors in pose estimation could degrade performance.
No explicit adversarial robustness or extreme domain shift evaluation beyond zero-shot held-out datasets is reported.

Open questions / follow-ons

How would DepthMaster perform with more diverse or larger-scale panoramic training data, possibly augmented by synthetic or generated examples?
Can the CCL loss concept be extended to other geometric tasks or multi-view fusion beyond monocular depth, such as semantic segmentation or surface reconstruction?
What are the limits of geometric conditioning via virtual camera pose; could learning explicit spatial priors or deformable patch stitching improve results further?
How might this approach integrate with self-supervised or unsupervised depth learning methods to reduce dependency on metric ground truth?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners, DepthMaster offers an innovative strategy to unify complex visual input types—standard perspective images and omnidirectional panoramas—within a single model framework. This capability can be valuable when designing robust image-based human verification systems working across diverse camera geometries, such as surveillance feeds or 360° environment captures. The geometric canonicalization and soft consistency loss methodologies could inspire new approaches to fuse distorted camera inputs without costly architectural changes, enhancing generalization and scalability. Moreover, the demonstrated data efficiency in leveraging large-scale perspective data to benefit panoramic domains highlights potential cost savings and model simplification for defenses relying on geometric consistency or depth cues. Overall, DepthMaster’s design emphasizes maintaining backbone compatibility and training stability, principles worth considering when building complex image analysis pipelines in adversarial or heterogeneous deployment settings.

Cite

bibtex

@article{arxiv2606_12368,
  title={ DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images },
  author={ Pengfei Wang and Shihao Wang and Liyi Chen and Zhiyuan Ma and Guowen Zhang and Lei Zhang },
  journal={arXiv preprint arXiv:2606.12368},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12368}
}

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​