Skip to content

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

Source: arXiv:2606.12374 · Published 2026-06-10 · By Sadman Sakib Enan, Junaed Sattar

TL;DR

This paper addresses the critical challenge of enabling autonomous underwater vehicles (AUVs) to effectively collaborate with human divers by recognizing diver activities in complex underwater scenes. Existing underwater human-robot interaction is limited by data scarcity and environmental challenges such as low visibility. The authors present DAR-Net, a novel transformer-based framework combining temporal reasoning with pixel-level scene semantics through a multi-loss training strategy to jointly learn activity classification and spatial segmentation. The approach explicitly guides the network’s attention to relevant scene elements including divers, robots, and objects, mitigating background noise effects especially important for underwater conditions. To support this research, they introduce the first-ever Underwater Diver Activity (UDA) dataset containing over 2,600 semantically annotated images from closed-water trials featuring six distinct diver activity categories with pixel-level masks. Experimental evaluation shows DAR-Net achieves 73.33% accuracy on a held-out test set, outperforming multiple state-of-the-art video action recognition models by 7-20 percentage points, and the use of semantic supervision significantly improves attention focus and classification performance. While promising, limitations remain in data scale and diversity, motivating future work for dataset expansion and robustness in open water scenarios.

Key findings

  • DAR-Net achieves 73.33% classification accuracy on the UDA test set, outperforming LateTemporal (66.67%), RRCommNet (60.00%), 3DResNet (53.33%), SlowFast (56.67%), and R(2+1)D (60.00%).
  • Average precision for DAR-Net is 76.9%, recall 73.3%, and F1-score 72.17%, indicating balanced detection of true positives and low false positives/negatives.
  • Semantic supervision enables DAR-Net to focus attention on relevant regions (divers, robots, and objects) rather than irrelevant background, as visualized in attention maps (Fig. 5).
  • Without semantic guidance, the model’s attention disperses to irrelevant areas like pool lane markings, reducing accuracy.
  • UDA dataset provides 2,640 pixel-level annotated images covering 6 diver activities, supporting robust spatio-temporal feature learning.
  • DAR-Net’s multi-loss formulation combines transformer classification loss and segmentation loss with trainable weights alpha and beta, dynamically balancing global and local context during training.
  • DAR-Net struggles mainly to distinguish 'Diver Busy' vs 'Robot-Diver Interaction' due to subtle temporal cues and shared visual primitives leading to lower per-category precision and recall (Fig. 7).
  • Training convergence seen after 200 epochs on Nvidia RTX6000 GPU, demonstrating stable learning for both classification and segmentation branches.

Threat model

n/a — paper focuses on perceptual activity recognition for underwater human-robot collaboration without an explicit security or adversarial threat model. The adversarial aspects such as malicious spoofing, tampering, or deception attacks on the perception system are not addressed.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly modeled as this is primarily a perception and activity recognition task for underwater human-robot collaboration. The system assumes closed-water environments with divers and AUVs equipped with RGB video capture hardware. The model does not assume adversarial input tampering or attacker knowledge.

Data: The UDA dataset was collected via GoPro video recordings in controlled closed-water pools with 1-3 divers, 1-2 AUVs, and task-specific objects involved. The dataset contains 2,640 annotated images with pixel-level masks for divers, robots, and objects across 6 diver activity categories. The data was split 80/20 for training and validation. Video clips are approx 3 seconds long (64 frames) at 1920x1080 resolution, downsampled to 320x256 for processing. Annotation correctness was manually verified.

Architecture & Algorithm: DAR-Net uses a ResNeXt-101 backbone for spatial feature extraction employing a split-transform-aggregate strategy optimized for underwater texture representation. The extracted spatio-temporal features are enriched with positional encodings. The network splits into two branches: a transformer-based classification branch trained with cross-entropy to predict one of six diver activities based on global temporal features, and an encoder-decoder segmentation branch trained with binary cross-entropy to predict pixel-wise masks of divers, robots, and objects. The segmentation encoder output is fused with the classification transformer to provide semantic guidance.

A multi-loss objective combines classification loss (Lclass) and segmentation loss (Lseg) with trainable weights (alpha, beta) enabling the network to balance global temporal activity recognition with local pixel-level semantic supervision. Skip connections are added in the segmentation branch to prevent vanishing gradients.

Training Regime: The model was trained using AdamW optimizer with learning rate 1e-5 and momentum 0.9, batch size 4, over 200 epochs on a Nvidia RTX6000 Ada GPU. Data augmentation techniques including random cropping, flipping, and image distortion were applied to improve robustness. Input clips of 64 frames were resized to 320 x 256 pixels.

Evaluation Protocol: A held-out test set of 30 videos (5 per activity category, distinct from training/validation) was used. Models were evaluated on accuracy, precision, recall, and F1-score as weighted averages. Performance was compared against six SOTA models fully retrained on UDA with their recommended hyperparameters to ensure fair comparison. Ablation studies compared DAR-Net with and without semantic supervision, analyzed attention maps, and per-category performance via confusion matrix and precision-recall curves.

Reproducibility: The authors plan to publicly release the UDA dataset at https://irvlab.cs.umn.edu/uda. Code release status is not explicitly stated. Weights or detailed training scripts are not currently mentioned. Dataset availability supports reproducibility of training and evaluation results.

Concrete Example: A 3-second underwater video clip (64 frames) is resized and fed into DAR-Net. ResNeXt-101 extracts spatial features, positional encodings added, then features are processed in parallel classification/segmentation branches. The classification branch’s transformer attends temporally to identify key frames and interactions reflecting one of the six diver activities. The segmentation branch simultaneously predicts pixel-level masks identifying divers, robots, and objects. During training, classification and segmentation losses combine dynamically influencing backbone and heads to learn discriminative features focused on semantically important scene elements. At inference, the model outputs the activity class label with maximum softmax probability. Attention maps verify focus on divers and robots rather than background noise, explaining improved accuracy.

Technical innovations

  • Integration of a multi-loss objective combining transformer-based temporal classification with pixel-level semantic segmentation to jointly leverage global activity context and local scene semantics.
  • Use of trainable weighting parameters (alpha, beta) in the multi-loss function to dynamically balance classification and segmentation supervision during training.
  • Creation and public release of the UDA dataset containing over 2,600 pixel-annotated underwater images capturing six diver activity categories in multi-human-robot settings.
  • Application of semantic supervision to focus attention mechanisms on relevant scene elements (divers, robots, objects) in underwater multi-agent activity recognition, improving robustness in low-visibility environments.

Datasets

  • Underwater Diver Activity (UDA) — 2,640 annotated images with pixel-level masks including divers, robots, and objects — collected from closed-water pool trials, publicly released at https://irvlab.cs.umn.edu/uda

Baselines vs proposed

  • 3DResNet: accuracy = 53.33% vs DAR-Net: 73.33%
  • R(2+1)D: accuracy = 60.00% vs DAR-Net: 73.33%
  • SlowFast: accuracy = 56.67% vs DAR-Net: 73.33%
  • LateTemporal: accuracy = 66.67% vs DAR-Net: 73.33%
  • RRCommNet: accuracy = 60.00% vs DAR-Net: 73.33%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.12374.

Fig 1

Fig 1: Demonstration of the proposed diver activity recognition framework,

Fig 2

Fig 2: A few sample images and their semantic labels from the proposed

Fig 3

Fig 3 (page 1).

Fig 4

Fig 4 (page 1).

Fig 5

Fig 5 (page 1).

Fig 6

Fig 6 (page 1).

Fig 3

Fig 3: An overview of the network architecture of DAR-Net. It takes an underwater diver activity video clip as input and extracts highly discriminative

Fig 4

Fig 4: The training performance of DAR-Net. Note the convergence in

Limitations

  • UDA dataset size (~2600 images) is relatively small for transformer architectures that generally require larger datasets for strong generalization.
  • Data collected only in controlled, closed-water pool environments with limited scene variability and lighting conditions, limiting real-world generalization.
  • Limited temporal extent of short video clips (~3 seconds) reduces the ability to disambiguate visually similar activities requiring longer temporal context, e.g., Diver Busy vs Robot-Diver Interaction.
  • Model evaluation misses adversarial robustness assessment or performance under varying underwater environmental conditions such as turbidity, currents, or open sea.
  • Code and trained model weights are not yet publicly available, which may complicate independent reproducibility.

Open questions / follow-ons

  • How can training on larger and more diverse underwater datasets collected in open-water conditions improve generalization and robustness of diver activity recognition models?
  • To what extent can synthetic data augmentation or simulation-generated scenes reduce the need for costly real-world underwater data collection?
  • How can the model architecture be refined to better disentangle visually and temporally similar activities, possibly by incorporating longer video sequences or multi-modal sensors?
  • What are effective strategies for incorporating robustness against environmental variability like turbidity, lighting changes, and occlusion common in real underwater deployments?

Why it matters for bot defense

While this paper does not directly address bot-defense or CAPTCHA challenges, its core contributions on semantically guided attention and multi-loss training for robust activity recognition provide insights relevant to CAPTCHA and bot-defense practitioners focusing on video and behavior-based detection. The approach highlights that incorporating local scene semantics as auxiliary supervision can significantly improve classification accuracy in challenging visual domains—a principle that could translate to dynamic CAPTCHA systems attempting to differentiate human-like behaviors from bots in noisy visual settings. Furthermore, the creation of a domain-specific annotated dataset exemplifies how bespoke datasets are critical for training effective models in specialized contexts, a lesson relevant to CAPTCHA design and bot detection in novel environments. The paper also underscores the importance of temporal modeling combined with spatial semantic guidance—concepts beneficial for longitudinal user behavior analysis in bot-defense applications.

Cite

bibtex
@article{arxiv2606_12374,
  title={ Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration },
  author={ Sadman Sakib Enan and Junaed Sattar },
  journal={arXiv preprint arXiv:2606.12374},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.12374}
}

Read the full paper

Last updated:

Articles are CC BY 4.0 — feel free to quote with attribution