Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

Source: arXiv:2605.22697 · Published 2026-05-21 · By Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

TL;DR

This paper addresses a fundamental challenge in human action recognition (HAR): robust zero-shot recognition of actions under strong geometric domain shifts, such as differing camera viewpoints and body orientations, across diverse datasets. While prior Zero-Shot Action Recognition (ZSAR) methods mostly assume similar geometric conditions at training and testing, this work targets realistic cross-domain deployment where unseen actions and new viewpoints degrade performance. The authors propose an orientation-aware multi-view framework that leverages synthetic virtual camera projections of motion, combined with textual action descriptions enriched by body orientation information. A novel dual-branch attention-based network encodes motion sequences conditioned on orientation angles and aligns them with orientation-aware textual prompts using a contrastive loss during training. The approach only requires multi-view and textual data at training, supporting both multi-view and single-view inference. Experiments across five benchmarks—including NTU-RGB+D, NW-UCLA, BABEL, and two surveillance datasets—show consistent and significant gains over recent state-of-the-art zero-shot methods, improving cross-domain generalization and seen-class recognition. The learned representations also yield robust transfer learning performance. Code and models are publicly released. Overall, the paper demonstrates that explicitly modeling viewpoint and orientation cues in both motion encoding and text alignment substantially narrows the domain gap that limits zero-shot generalization in HAR.

Key findings

Orientation-aware multi-view pretraining on BABEL yields +3.46% and +14.62% top-1 accuracy improvements over prior zero-shot state-of-the-art ViA on NTU-RGB+D and NW-UCLA datasets (Tab. 1).
Zero-shot cross-domain (ZSCD) accuracy more than doubles compared to ViA, e.g., 26.17% vs 9.06% on NTU-60 and 56.12% vs 38.72% on NW-UCLA (Tab. 2).
Multi-view pretraining outperforms larger datasets without multi-view, e.g., BABEL MV (44k samples) beats Posetics SV (142k samples) on cross-subject and cross-view splits (Tab. 3).
Same-domain classification on BABEL-60 improves top-1 accuracy by +5.6% and top-5 by +7.2% over best prior methods (Tab. 4).
Ablation shows that multi-view projection training adds +4.6% top-1 accuracy alone; orientation-aware network adds further +1.0-1.3% gains (Tab. 5).
Orientation-aware text prompts generated by GPT-3.5 improve motion-text embedding alignment and ZSCD accuracy with CLIP compared to simple class label texts (Fig. 3).
Model supports multi-view inference by averaging probabilities, further improving accuracy over single-view inference.
The approach can handle both seen and unseen actions across domains without fine-tuning on target domains.

Threat model

n/a — The paper focuses on robust human action recognition under domain and viewpoint shifts, rather than security threats or adversarial attackers. The assumed challenge is domain gap and unseen actions without access to target domain labels at test time.

Methodology — deep read

Threat model and assumptions: The adversary is not explicitly modeled as a security threat; instead, the scenario assumes unseen action classes and significant geometric domain shifts at test time, with no annotated data available for new domains. The model assumes access to multi-view motion sequences and text descriptions during training, but only single or multi-view motion without text at inference.
Data: Training uses 3D motion sequences from the BABEL dataset (44k samples) with SMPL body parameters. Virtual 2D views are rendered by projecting 3D skeleton joints into 12 uniformly spaced yaw angles mimicking multi-view camera settings, simulating occlusions by checking line-of-sight limb visibility. Text action descriptions are generated for each orientation using GPT-3.5 prompted templates to produce viewpoint-specific semantic descriptions. Test sets include NTU-RGB+D, NW-UCLA, RHM-HAR, and MCAD datasets, covering both lab-recorded and in-the-wild surveillance data. Data splits separate seen/unseen classes.
Architecture: The core model consists of:

A projection component that generates multi-view 2D skeleton sequences from a single 3D motion sequence at training.
An orientation-aware network utilizing a dual-branch multi-head cross-attention architecture. One branch conditions motion features on orientation queries, and the other conditions motion queries on orientation features. Orientation angles are encoded as high-dimensional positional encodings (L=192) derived from sine and cosine functions across multiple frequencies.
The motion encoder is a ProtoGCN spatio-temporal graph convolutional network extracting features from 2D skeletons.
Textual action descriptions enriched with orientation cues are encoded with CLIP text encoder. Descriptions are generated via LLM prompting.
A joint embedding space aligns motion and text features.

Training regime: The joint training minimizes a weighted combination of contrastive loss (pulling together matching text-motion pairs and pushing apart mismatches) and cross-entropy classification loss on motion features. Batch size is 16, Adam optimizer with initial learning rate 0.001 decaying at epoch 30. Positive and negative pairs constructed by concatenation. Multi-view projections and additive noise augment training. Training takes ~12 hours on NVIDIA A5000.
Evaluation protocol: Metrics reported mainly top-1 and top-5 accuracy on standard splits separating seen and unseen classes for zero-shot learning (ZSL) and zero-shot cross-domain (ZSCD) experiments. Baselines include CADA-VAE, JPoSE, ReViSE, DeViSE, PURLS, Neuron, and ViA among others. Ablation studies analyze contribution of projection, orientation-aware modules, and text prompts. Inference on multi-view sequences averages output probabilities. No distribution shift robustness testing beyond evaluated datasets mentioned.
Reproducibility: Code and pretrained weights are publicly released. Datasets are standard in HAR community. Synthetic projection code and LLM prompting details are sufficiently described for reproduction.

Concrete example end-to-end: During training, a 3D skeleton motion from BABEL is projected into 12 virtual camera viewpoints resulting in 2D joint sequences with simulated occlusions. Each sequence is passed through ProtoGCN to extract features. The orientation angle of each view is encoded with sinusoidal positional embeddings and used in a dual-branch cross-attention module to condition motion features on viewpoint. Concurrently, GPT-3.5 prompts generate orientation-aware textual descriptions for each action. CLIP encodes these text prompts. The motion and text features are aligned in a shared embedding space using a contrastive loss, while a motion classification loss guides learning. At inference, a single-view 2D skeleton from a novel domain is encoded and compared against text features of candidate actions from that domain to classify actions zero-shot, robust to orientation changes.

Technical innovations

Orientation-aware dual-branch multi-head cross-attention network conditioning motion features on continuous body orientation positional encodings to improve viewpoint robustness.
Multi-view training strategy that generates virtual 2D skeleton projections with simulated occlusions from 3D pose sequences, increasing domain invariance without requiring multiple cameras at inference.
Use of large language model (GPT-3.5) to augment textual action descriptions with explicit viewpoint and limb visibility cues, enhancing cross-modal motion-text alignment.
Joint training with contrastive loss and classification loss on motion and text embeddings to unify zero-shot classification and feature learning.
Capability to perform both multi-view and single-view inference by averaging probabilistic outputs across views or using a single viewpoint.

Datasets

BABEL — 44k motion sequences — public dataset used for multi-view pretraining
NTU-RGB+D-60 — standard HAR benchmark with 60 action classes, 30/30 seen/unseen split — public
NW-UCLA — smaller HAR dataset with 10 action classes, 5/5 split — public
RHM-HAR — multi-view surveillance dataset with robot and fixed cameras, 7/7 action split — real-world application, public
MCAD — CCTV multi-view surveillance dataset with 18-27 actions, cross-subject/cross-view/cross-action splits — public

Baselines vs proposed

ViA (Pretrained) on NTU-60 ZSL: 25.02% top-1 accuracy vs Ours: 28.48%
ViA (Pretrained) on NW-UCLA ZSL: 54.47% top-1 vs Ours: 69.09%
ViA on RHM-HAR ZSL: 29.62% vs Ours: 52.03%
ViA on MCAD ZSL: 18.70% vs Ours: 50.12%
ViA on NTU-60 ZSCD (zero-shot cross-domain): 9.06% top-1 vs Ours: 26.17%
ViA on NW-UCLA ZSCD: 38.72% vs Ours: 56.12%
On BABEL-60 same domain top-1: MotionPatches 41.33%, ViA 41.79%, Ours 47.41%
Projection component alone with 2s-AGCN backbone adds +4.61% top-1 accuracy over baseline
Orientation-aware network adds further +1.0-1.3% top-1 gains in ablation

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.22697.

Fig 1

Fig 1: Geometric domain gaps actions over four diﬀerent datasets. The

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

Requirement of body orientation estimation at inference adds computational overhead not integrated into pose estimation pipeline yet.
Evaluation does not consider robustness to adversarial attacks or explicit adversarial domain shifts beyond available datasets.
No reported ablation or results on very large-scale heterogenous uncurated in-the-wild video datasets.
Dependence on synthetic multi-view projections during training, which may differ from real multi-camera conditions.
Orientation-aware text prompts rely on LLM-generated descriptions, which may introduce bias or complexity if labels are unavailable or inconsistent.
Cross-domain zero-shot performance, though improved, still yields relatively modest accuracies (<30% top-1 on some splits), indicating room for improvement.

Open questions / follow-ons

How can body orientation estimation be integrated efficiently with pose estimation to reduce inference overhead?
Can the model generalize to large-scale, in-the-wild video datasets with unconstrained camera motions and occlusions?
What is the impact of noisy or limited textual action descriptions on the cross-modal alignment and zero-shot performance?
How would the method perform under strong distribution shifts such as adversarial manipulations or synthetic-to-real transfer?

Why it matters for bot defense

This work is highly relevant to practitioners in bot defense and CAPTCHA design seeking robust human action recognition models capable of generalizing across domains and unseen categories. The orientation-aware multi-view training addresses a key geometric domain gap related to varied camera angles and subject orientations, which frequently degrade real-world performance. By combining body orientation conditioning with textual prompt alignment, the approach improves zero-shot recognition without requiring exhaustive annotated data in every deployment environment. Bot-defense systems could apply similar multi-view and cross-modal learning principles to better distinguish human motion patterns from scripted or synthetic behaviors under diverse observation conditions. The method’s demonstrated transfer capabilities on surveillance datasets are particularly applicable for real-time monitoring tasks in security contexts. However, the need for reliable orientation estimation and computational cost of multi-view encoding may impose constraints in practice. Overall, this paper offers a rigorous, multi-component strategy for increasing recognition robustness against the geometric variability often encountered in automated bot detection and human verification systems.

Cite

bibtex

@article{arxiv2605_22697,
  title={ Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions },
  author={ Yannick Porto and Renato Martins and Thomas Chalumeau and Cedric Demonceaux },
  journal={arXiv preprint arXiv:2605.22697},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.22697}
}

Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​