Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning

Source: arXiv:2606.19122 · Published 2026-06-17 · By Yukai Ma, Joe Lin, Liu Liu, Honglin He, Lulu Ricketts, Brad Squicciarini et al.

TL;DR

This paper addresses the challenge of monocular 3D occupancy perception for mobile robots navigating cluttered, less structured sidewalk environments. Existing 3D occupancy methods largely target on-road autonomous driving scenarios using multi-camera and extensive LiDAR supervision, which is costly and insufficient for sidewalks. The authors propose WalkOCC, a hybrid framework combining ray-marching 3D occupancy learning with monocular RGB images. The approach bootstraps pseudo 3D occupancy supervision from limited paired LiDAR-RGB sidewalk sequences and leverages large-scale unpaired monocular images through a novel 2D-3D consistency loss that enforces ray-based semantic alignment. This hybrid training yields more stable optimization and better generalization without requiring dense 3D labels. They also contribute Sidewalk3D, a new large-scale RGB-LiDAR dataset tailored for sidewalks that includes semantic occupancy annotations for evaluation.

Extensive experiments demonstrate WalkOCC outperforms state-of-the-art monocular occupancy baselines in mIoU (16.46% vs 14.23%) and occupancy IoU (30.02% vs 27.17%) on their benchmark. The gains are especially significant on dynamic and safety-critical classes like pedestrians and vehicles. WalkOCC shows superior robustness to out-of-domain environmental shifts (day to night, diverse locations) and cross-embodiment shifts (different robot platforms and camera intrinsics), retaining up to 55% relative mIoU on challenging night scenes compared to under 33% for baselines. Qualitative results also reveal finer recovery of subtle sidewalk structures such as curbs and gutters. The hybrid ray-marching training strategy coupling 2D and 3D supervision is the key innovation enabling scalable, cost-efficient monocular 3D occupancy perception in sidewalk robotics scenarios.

Key findings

WalkOCC achieves 16.46% mIoU and 30.02% occupancy IoU on Sidewalk3D test, improving over strongest baseline FlashOcc at 14.23% mIoU and 27.17% occ IoU (Table 1).
WalkOCC boosts prediction accuracy on critical dynamic categories, e.g. pedestrian mIoU improves from 8.82% to 14.59%, vehicle from 14.12% to 18.74%, cyclist from 3.27% to 4.15%.
Cross-domain experiments show WalkOCC with hybrid 2D+3D training improves OOD mIoU from 5.55% to 8.61% on night scenes (Set 1), a 55% relative gain over baseline FlashOcc (Table 3).
WalkOCC achieves 73.8% pixel accuracy and 25.5% 2D mIoU on cross-embodiment Set 3, outperforming WalkOCC without 2D data (71.9% and 24.7%) (Table 2).
Pseudo-3D occupancy labels generated by WalkOCC's pipeline closely match human-verified subsets, validating their use for training (Table 1 first row).
Depth-aware lifting combined with ray-marching consistency loss enables effective fusion of sparse 3D supervision and abundant 2D semantic data for monocular occupancy prediction.
WalkOCC can recover fine-grained sidewalk features like curbs, gutters, and obstacles more accurately than baselines, reducing noise in predictions (Fig. 4).
Hybrid training with 2D images enhances robustness to environmental conditions and robot embodiment shifts, preserving clearer scene understanding in OOD settings (Fig. 5).

Threat model

n/a — The paper does not define an explicit adversarial threat model since it focuses on 3D occupancy perception for navigation, not adversarial robustness or security attacks.

Methodology — deep read

The paper proposes WalkOCC, a monocular 3D semantic occupancy prediction framework tailored for urban sidewalk robots with the following comprehensive approach:

Threat Model & Assumptions: The adversary is not explicitly modeled as this is a perception method for robotic navigation. The method assumes access to limited paired RGB-LiDAR data from sidewalk scenes (used to generate pseudo 3D occupancy supervision) and abundant unpaired monocular RGB images. It presumes accurate camera calibration and fixed occupancy volume parameters.
Dataset: They collect Sidewalk3D, a large-scale RGB-LiDAR paired dataset across multiple urban sidewalk locations and time periods, annotated with semantic 3D occupancy labels on 14 classes including road, sidewalk, pedestrians, obstacles, curbs, and gutters. The dataset covers different robot embodiments (humanoid, quadruped, wheeled) and varied illumination conditions. The training split uses ~2.4K touristy daytime images paired with LiDAR, augmented by 2.2K unpaired 2D-only images from the MIMIC dataset.
Architecture: WalkOCC follows an Encoder → Lift → BEV → Decoder paradigm. A ResNet-50 backbone with FPN extracts 2D image features. A depth head predicts per-pixel depth distributions supervised by LiDAR depth. Using a depth-aware lifting module, 2D features are lifted along predicted depth bins into a 3D frustum-aligned voxel feature volume. These features are transformed to bird’s-eye-view (BEV) and refined via a lightweight BEV encoder. Finally, a 3D occupancy decoder outputs voxel-wise semantic occupancy predictions over the fixed spatial grid.
Losses and Training: The model is trained with a hybrid supervision combining:

2D semantic segmentation loss on image-plane semantic masks,
3D voxel-wise focal cross-entropy occupancy loss against pseudo-3D ground truth,
Depth regression loss guided by LiDAR depth,
A novel 2D-3D consistency loss enforcing that semantic labels aggregated by ray-marching from 3D occupancy volumes match 2D semantic predictions. This is done by sampling 3D occupancy logits along camera rays with depth-weighted aggregation,

These complementary losses regularize learning and enable leveraging large-scale unpaired 2D data alongside limited 3D labels.

Training regime: The models train over 24 epochs with batch sizes adapted to 8 NVIDIA A5000 GPUs. Inputs are resized monocular RGB fisheye images (544×960). Depth is discretized into ~235 bins over [0.2m, 12m] with 0.05m resolution. The final occupancy volume covers a 10×10×5 m cuboid around the robot with voxel size 0.1 m.
Evaluation: Primary metrics are 3D semantic mean Intersection-over-Union (mIoU) calculated on 14 classes, and binary occupancy IoU (occupied vs free). Cross-domain OOD generalization is tested across three subsets—nighttime environmental shift, diverse locations, and cross-robot embodiment with distinct camera intrinsics. Cross-embodiment evaluation uses 2D projection metrics (pixel accuracy, 2D mIoU) due to lack of ground-truth 3D labels.
Reproducibility: The authors commit to releasing the code, trained weights, and the Sidewalk3D dataset to facilitate future research. Details of pseudo-label generation, dataset preprocessing, and training tricks are included in appendices.

End-to-end example: Given a monocular RGB image from a sidewalk robot camera, WalkOCC extracts 2D features, predicts depth distributions, and lifts features to 3D using depth-weighted lifting. The BEV encoder refines spatial features, and the 3D decoder outputs voxel-wise semantic occupancy prediction. Via ray marching, the predicted 3D occupancy is projected back to 2D, and a semantic consistency loss aligns this with 2D segmentation predictions. The model is trained to jointly minimize 2D segmentation, 3D occupancy, depth, and consistency losses, supervised by pseudo-labels derived from LiDAR-RGB data plus additional 2D-only data. This results in occupancy predictions robust to different lighting, locations, and robot types, accurately localizing dynamic agents and fine sidewalk structures.

Technical innovations

A hybrid monocular 3D occupancy prediction model combining pseudo 3D supervision from paired RGB-LiDAR data with large-scale unpaired monocular images using a 2D-3D ray-marching consistency loss.
Depth-aware lifting of front-facing monocular features into a 3D voxel grid using predicted depth distributions to enable 3D occupancy estimation from a single camera.
Ray-marching-based semantic aggregation that aligns 3D occupancy logits with 2D semantic segmentation predictions, regularizing training with cross-modality consistency.
Introduction of Sidewalk3D, a large-scale cross-domain RGB-LiDAR paired dataset focused on challenging sidewalk environments for mobile robots.

Datasets

Sidewalk3D — ~2.4K paired RGB-LiDAR sidewalk sequences plus 2.2K unpaired monocular sidewalk images — collected across diverse urban locations, day/night, and multiple robot platforms, includes semantic 3D occupancy annotations for evaluation
MIMIC — 2D-only monocular image dataset used as additional unpaired 2D semantic data for hybrid training

Baselines vs proposed

FlashOcc: mIoU = 14.23%, occlusion IoU = 27.17% vs WalkOCC: mIoU = 16.46%, occupancy IoU = 30.02%
WalkOCC w/o 2D: OOD Set 1 mIoU = 5.70% vs WalkOCC (with 2D data): 8.61% (55% relative improvement over FlashOcc’s 5.55%)
WalkOCC w/o 2D: cross-embodiment pixel accuracy = 71.90%, 2D mIoU = 24.73% vs WalkOCC: 73.8%, 25.5%
MonoScene: mIoU = 8.42% vs WalkOCC: 16.46%
RenderOcc: mIoU = 13.03% vs WalkOCC: 16.46%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.19122.

Fig 1

Fig 1: 3D occupancy prediction of challenging real-world sidewalk scenes for various mo-

Fig 2

Fig 2: Learning Framework of WalkOCC. WalkOCC is a lightweight BEV-based occupancy

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

WalkOCC assumes accurate camera intrinsic/extrinsic calibration; errors here degrade performance.
Pseudo-label quality depends on the limited paired RGB-LiDAR data and can contain noise affecting model accuracy.
Fixed occupancy volume (10×10×5 m) may not scale well to very large or complex sidewalk scenes.
Performance on thin or sparse classes like trees remains suboptimal compared to some baselines.
Evaluation cross-embodiment lacks full 3D ground truth and relies on 2D projections potentially limiting measurement fidelity.
Robustness tested on a limited set of environmental and embodiment shifts; extreme conditions like heavy rain or snow remain untested.

Open questions / follow-ons

Can WalkOCC’s framework be extended to multi-view or video inputs to improve temporal consistency in occupancy predictions?
How would explicit uncertainty modeling in depth estimation and occupancy outputs affect robustness in highly dynamic or unstructured sidewalk environments?
Can the hybrid training paradigm be adapted for fully unsupervised sidewalk occupancy learning without any paired RGB-LiDAR data?
How can the approach be generalized to other robot sensor modalities or integrated with active perception strategies to optimize data collection?

Why it matters for bot defense

This work provides valuable insights for bot-defense engineers interested in robust 3D spatial awareness from monocular cameras under diverse environmental conditions. Its hybrid training strategy demonstrates how to leverage limited paired 3D data with abundant 2D images to build generalized spatial perception models resistant to domain shifts, which is analogous to improving bot-detection or CAPTCHA systems that must adapt across different platforms and lighting. The ray-marching based 2D-3D consistency concept could inspire multi-modal alignment techniques in bot-detection contexts where visual and auxiliary sensor data are fused. The introduced Sidewalk3D dataset offers a rare, well-annotated domain-specialized benchmark valuable for testing real-world performance. Overall, the paper advances monocular 3D scene understanding with practical considerations for real robot deployments, a principle transferable to designing perceptually robust, data-efficient defenses in security-sensitive applications.

Cite

bibtex

@article{arxiv2606_19122,
  title={ Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning },
  author={ Yukai Ma and Joe Lin and Liu Liu and Honglin He and Lulu Ricketts and Brad Squicciarini and Yong Liu and Bolei Zhou },
  journal={arXiv preprint arXiv:2606.19122},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.19122}
}

Monocular 3D Occupancy Perception for Robots on Sidewalks via Hybrid 2D-3D Learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​