NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests
Source: arXiv:2606.14562 · Published 2026-06-12 · By Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot, Devis Tuia et al.
TL;DR
This paper addresses the shortage of high-resolution, multimodal datasets capturing the complex 3D structure of sociable weaver nests embedded in savanna trees. Prior ecological studies relied on coarse or manual measurements lacking fine-grained spatial detail, limiting automated analysis of nest geometry and ecology. To fill this gap, the authors introduce NEST3D, a 1.4 TB publicly available dataset comprising 104 nest-bearing trees with 27,945 RGB images, 111,780 multispectral images across four bands, approximately 781 million 3D points reconstructed from drone imagery, and expert-labeled semantic segmentation masks for three classes: nest, tree, and grass. The dataset is unique in providing a richly annotated, multimodal 3D ecological benchmark for automated structural and spectral analysis of these critical communal bird nests.
They benchmark leading 3D point cloud segmentation architectures—KPConv (convolutional), RandLA-Net (point-sampling), and the transformer-based Point Transformer V3 (PT-v3)—to establish baseline segmentation performance. PT-v3 achieves the best mean Intersection over Union (mIoU) of 86.35% on the test set, notably a 69.99% IoU on the challenging, minority nest class, demonstrating the effectiveness of global-local self-attention for capturing nest structure despite severe class imbalance. RandLA-Net performs moderately well on majority classes but poorly on nests (mIoU 17.98%), with KPConv collapsing to always predict majority class. This evaluation highlights architecture-dependent challenges for ecological 3D segmentation under extreme class imbalance and points towards transformer models as a promising path forward.
Overall, NEST3D fills a critical ecological and computer vision data gap, enabling detailed 3D structural analysis of sociable weaver nests, supporting cross-disciplinary research from nest volume estimation to conservation, and posing a new demanding benchmark for segmentation and reconstruction in challenging natural scenes.
Key findings
- NEST3D dataset contains 104 scenes with approximately 781 million annotated 3D points, 27,945 RGB, and 111,780 multispectral images.
- Nest points comprise less than 5% of total points, creating extreme class imbalance in semantic segmentation.
- Point Transformer V3 achieves 86.35% mean IoU on test set with 69.99% IoU on minority nest class.
- RandLA-Net scores 50.72% mean IoU overall but only 17.98% IoU on nest and suffers from high false positives between tree and nest classes.
- KPConv fails to learn minority class structure, collapsing to 0% nest IoU and predicting only majority (grass) class, resulting in 49.19% overall accuracy.
- Drone-based photogrammetry reconstruction combined with rigorous filtering produced metric-scale accurate 3D point clouds capturing complex tree and nest structures.
- Semantic labels were manually annotated by experts across three classes (nest, tree, grass) with emphasis on fine-grained boundary delineation.
- Severe structural and spectral similarity between nests and host trees pose substantial challenges for convolutional architectures.
Threat model
n/a - this work is a dataset and benchmarking study focused on ecological computer vision. It does not model adversarial threats or security attacks.
Methodology — deep read
Threat Model & Assumptions: The dataset targets ecological analysis and computer vision segmentation, not adversarial settings. The challenge is the complex 3D geometry of nests embedded in tree canopies, with severe class imbalance. No explicit adversarial threat model is discussed.
Data: Data was collected in the Kuzikus Wildlife Reserve, Namibia, a semi-arid savanna hotspot of sociable weavers nesting in camel thorn and shepherd’s trees. Drone flights used a DJI Mavic 3 with a 20MP RGB and 4-band multispectral camera (Green, Red, Red Edge, NIR). Two daily sessions acquired images by circling individual trees at 5-10m distance, to optimize visibility and minimize disturbance. The dataset includes 104 unique nest-bearing trees: 83 training and 21 testing. Each tree scene contains 111-573 RGB images, 4x multispectral images per scene, and photogrammetrically reconstructed 3D point clouds in metric scale (using GPS). The 3D points include color and spatial coordinates and were manually semantically annotated by experts into three classes: nest, tree, and grass.
Architecture / Algorithm: Three architectures benchmarked: KPConv (kernel point convolution), RandLA-Net (random sampling and local aggregation), and Point Transformer V3 (PT-v3). KPConv uses deformable convolutions on spatial points. RandLA-Net collects features from k-nearest neighbors with random sampling and attention. PT-v3 employs global and local self-attention in a transformer backbone with 5 encoder and 4 decoder stages, channel widths 32-512. Inputs are voxelized and normalized into a unit sphere. Models predict per-point semantic labels.
Training Regime: All trained for 100 epochs with Adam or AdamW optimizers. PT-v3 used AdamW with OneCycleLR schedule. Class imbalance addressed by combining weighted cross-entropy (4x weight for nests) with Lovász loss in PT-v3; RandLA-Net used class-balanced random sampling with weights [1,3,1]. Batch sizes ranged 4-6, learning rates 0.001. Input clouds cropped to 50k points. KPConv configured with specific subsampling and convolution radii suitable for drone point clouds. Validation subsets from training were used for hyperparameter selection.
Evaluation Protocol: Fixed 80/20 train/test split at scene-level. Metrics reported include overall accuracy (OA), mean Intersection over Union (mIoU), and mean class accuracy (mAcc). Per-class IoUs and confusion matrices evaluated class-wise performance, focusing on minority nest class. Detailed per-scene and per-class metrics were tabulated. Cross-validation not explicitly mentioned.
Reproducibility: Dataset publicly available on Hugging Face under CC BY 4.0 license. Codebases of benchmarked models are open-source. The exact training scripts/configurations are not reported in full detail but baseline model implementations were referenced. Frozen weights were generated for final models evaluated.
Concrete example end-to-end: A single nest-bearing tree is flown over with the drone capturing simultaneous RGB and multispectral images. These images are processed with Agisoft Metashape to generate a dense 3D point cloud, representing millions of points with XYZ and RGB. Experts annotate points in CloudCompare into nest, tree, or grass. Point clouds are normalized and subsampled to 50k points for input. PT-v3 is trained using weighted cross entropy and Lovász loss to predict the semantic label distribution. After 100 epochs, on the test set, PT-v3 achieves ~86% mean IoU, correctly segmenting the nest structure with high precision. Error analysis shows small confusions with adjacent tree points due to similar spectral properties.
Limitations in architecture-specific learning are observed, e.g., KPConv failed due to convolutional kernel instability on sparse, irregular ecological point clouds.
Technical innovations
- Compilation of a large-scale, high-resolution, multimodal dataset combining RGB, multispectral, and 3D point clouds with expert semantic annotations of sociable weaver nests in natural tree canopies.
- Benchmarking ecological 3D point cloud semantic segmentation focusing on extreme class imbalance with minority nest class representing under 5% of points.
- Application of transformer-based Point Transformer V3 architecture with combined weighted cross-entropy and Lovász loss for improved segmentation of fine-grained ecological structures.
- Empirical demonstration of convolutional approaches (KPConv) failing to generalize under ecological data sparsity and class imbalance, highlighting architecture-dependent performance in natural scene analysis.
Datasets
- NEST3D — 104 scenes, ~781 million 3D points, 27,945 RGB images, 111,780 multispectral images — publicly available at https://doi.org/10.57967/hf/8978
Baselines vs proposed
- PT-v3: mean IoU = 86.35%, Nest IoU = 69.99%, Overall Accuracy = 96.42% vs RandLA-Net: mean IoU = 50.72%, Nest IoU = 17.98%, Overall Accuracy = 73.64%
- KPConv: mean IoU = 16.40%, Nest IoU = 0.00%, Overall Accuracy = 49.19% demonstrating collapse to majority class prediction
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.14562.

Fig 1: Workflow for creating the NEST3D dataset: (a) UAV-based data collection around a

Fig 2: Example tree-nest structures from the dataset. These four RGB images illustrate the

Fig 3: Confusion matrices for 3D point cloud classification comparing PTv3, RandLA-Net,

Fig 4: Qualitative comparison of 3D semantic segmentation architectures. Column (A) dis-
Limitations
- Severe class imbalance with nest points under 5% challenges segmentation and may bias metrics.
- Multispectral data was acquired but not integrated into the current baselines, leaving potential improvements unexplored.
- GPS-based scaling introduces centimeter-level positional errors affecting absolute spatial accuracy.
- Training and evaluation uses fixed splits without cross-validation, limiting generalizability assessment.
- Convolution-based methods struggle with sparse, irregular ecological point clouds, indicating limited architecture robustness under challenging data.
- The dataset is geographically limited to one reserve and two tree species, possibly restricting ecological diversity.
Open questions / follow-ons
- How can multimodal fusion of RGB, multispectral, and 3D point features improve segmentation accuracy on fine-grained ecological classes?
- What architectural modifications can improve robustness and learning stability of convolutional models on sparse, irregular ecological point clouds?
- How does domain adaptation or cross-dataset transfer perform when training on conventional forest/habitat datasets and testing on NEST3D?
- Can uncertainty estimation or calibrated confidence models better handle extreme class imbalance and improve segmentation reliability?
Why it matters for bot defense
For bot-defense and CAPTCHA practitioners, NEST3D offers insights on the challenges posed by extreme class imbalance and irregular 3D geometry in semantic segmentation tasks. The dataset exemplifies how transformer-based self-attention architectures can better disentangle subtle structural details from visually and spectrally similar backgrounds, compared to convolution-heavy baselines. This understanding could inform design decisions for multi-class classification systems in security contexts where rare classes or fine-grained distinctions matter (e.g., identifying anomalous or adversarial inputs).
Moreover, the paper highlights limits of kernel-based convolutions on sparse and noisy 3D ecological data, illustrating the importance of testing ML models in domain-specific data regimes closely matching target deployment. While not a security paper, the findings encourage benchmarking architectures on datasets with strong class imbalance, imbalanced distributions, and subtle boundary conditions—conditions often encountered when designing CAPTCHAs or bot-detection models to minimize false negatives and false positives. Future research directions suggested (e.g., multimodal fusion, transformer adaptations) will also be relevant to CAPTCHA defenses that combine multiclass classification with heterogeneous input modalities.
Cite
@article{arxiv2606_14562,
title={ NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests },
author={ Constanza A. Molina Catricheo and Simon Boeder and Ting-Jia Guo and Giacomo May and Clément Berthelot and Devis Tuia and Friedrich Fedor Reinhard and Fabio Remondino and Benjamin Risse },
journal={arXiv preprint arXiv:2606.14562},
year={ 2026 },
url={https://arxiv.org/abs/2606.14562}
}