Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion

Source: arXiv:2605.13816 · Published 2026-05-13 · By Nikolaos Tsalkitzis, Panagiotis P. Filntisis, Petros Maragos, Niki Efthymiou

TL;DR

This paper tackles the early detection of psychotic relapse using passively collected smartwatch biosignals through two novel Transformer-based frameworks. The first approach focuses on forecasting cardiac feature trajectories and flags anomalies as deviations from predicted values, while the second employs a multi-task learning model that integrates sleep, motion, and cardiac signals to learn time-aware embeddings and predict measurement timing. Both approaches estimate predictive uncertainty via an ensemble of multilayer perceptrons to robustly quantify anomaly scores on a daily basis. Importantly, the authors show these two models capture complementary physiological patterns linked with relapse and propose a late fusion strategy combining both anomaly scores, yielding a unified risk signal. On the 2nd e-Prevention Grand Challenge dataset, the fused method achieves an 8% relative improvement over the previous winning baseline, with extensive ablations demonstrating the benefit of diverse phenotype fusion and uncertainty-driven scoring.

Key findings

The forecasting model achieves an average metric (mean of AUROC and AUPRC) of 0.576 on test, outperforming the baseline (AVG=0.437) and the previous winning model (AVG=0.504) by about 7%.
The multi-task model reaches an AVG of 0.575, similarly beating prior baselines by ~7%.
Late fusion of the two models' daily anomaly scores further improves performance to an AVG of 0.584, corresponding to 8% relative gain over the prior SOTA.
Max fusion of anomaly scores produces the highest AUPRC (0.667), improving early retrieval of relapse days under severe class imbalance.
Forecasting model demonstrates stronger AUROC (better global separability), whereas multi-task model attains higher AUPRC (better detection under imbalanced relapse class).
Ablation found one-hour stride and two-hour window size best balance temporal context and data efficiency for both models.
ALiBi positional embeddings performed best for the multi-task sleep model; sinusoidal embeddings worked best for the forecasting cardiac model.
Negative threshold values (τ = -0.1/-0.2) yielded improved balance of detection sensitivity and precision in relapse decision rule.

Threat model

The threat model implicitly assumes an unsupervised anomaly detection scenario where the goal is to monitor physiological signals to detect psychotic relapse events as deviations from individual baseline behaviors. The adversary is not explicitly defined as an attacker but rather the system must be robust against natural variability and noise in wearable sensor data. The model cannot rely on perfect labels for relapse days and must operate with partial observability and uncertainty in daily biosignal patterns.

Methodology — deep read

The threat model assumes a clinical monitoring scenario where the goal is early detection of psychotic relapse from physiological and behavioral signals collected passively via smartwatch. The adversary in this context is implicit: the system must detect abnormal deviations, under realistic noise and variability in wearable data; no explicit adversarial tampering is assumed.

Data come from the public 2nd e-Prevention Grand Challenge (ICASSP 2024) Track 2 dataset, with continuous multi-modal biosignals (accelerometer, gyroscope, RR intervals, heart rate, step counts, and sleep) from 8 subjects with psychotic disorders. The dataset contains about 200 days of training data (remission only) and ~85-87 days test/validation including relapse and remission periods. Features are aggregated into 5-minute intervals and daily segments, including cardiac variability metrics (RMSSD, SDNN etc.) and time embeddings.

Two main Transformer encoder-based architectures are developed. The forecasting model inputs 5-minute time steps of activity and cardiac features and tries to predict next-step cardiac feature vectors via a Transformer encoder producing per-window embeddings followed by a prediction head for one-step-ahead cardiac forecasting. An ensemble of 5 MLP heads trained with online resampling promotes diversity, enabling element-wise variance across ensemble outputs to estimate predictive uncertainty per window, which is aggregated daily and normalized patient-specifically to derive an anomaly score.

The second model uses multi-task learning with a Transformer encoder trained to predict measurement time embeddings and an auxiliary sleep timing prediction head from fused sleep, motion, and cardiac features. During inference, uncertainty-driven anomaly scores derive from ensemble variance in measurement-time embedding predictions (sleep head uncertainty excluded for stability). Day-level variance scores from sleep and measurement-time tasks are combined via weighted averaging (0.3 / 0.7) and normalized per patient to generate an anomaly score.

Late fusion combines continuous anomaly scores from both models by weighted averaging (α = 0.7 on heart forecasting) or max/min fusion, prior to thresholding for relapse classification.

Models are trained for 50 epochs with learning rate 10^-3, batch size 16, weight decay 5e-4, using 5-head MLP ensembles for uncertainty estimation. Key hyperparameters (stride size, window length) and positional embedding types (sinusoidal, ALiBi, RoPE) were systematically evaluated. Evaluation uses AUROC, AUPRC, and their average (AVG) across 10 runs on held-out test data with daily relapse labels. Ablation studies explored decision threshold effects and fusion strategies. Patient-specific healthy distributions of uncertainty scores from remission data serve as baselines to normalize anomaly scores.

An example end-to-end flow for the forecasting model: 5-minute windowed sensor data for a day are embedded by the Transformer encoder, averaged to a window representation, fed into ensemble MLP heads predicting next-step cardiac features, variance across predictions quantifies uncertainty which is normalized relative to remission days to produce a daily anomaly score; if the score exceeds a threshold, the day is flagged as relapse.

Code or model weights are not indicated as publicly released in the paper. The dataset is publicly available via the e-Prevention Challenge repository.

Technical innovations

The introduction of an ensemble MLP uncertainty estimation over Transformer-derived embeddings to compute predictive uncertainty for anomaly scoring.
A forecasting pipeline predicting next-step cardiac feature vectors from wearable data to detect deviations indicative of relapse.
A multi-task Transformer model simultaneously predicting measurement timestamps and sleep timing embeddings to capture complementary circadian and sleep phenotypes.
A late-fusion strategy combining complementary anomaly scores from cardiac forecasting and multi-task sleep/time models into a unified daily relapse risk score.

Datasets

e-Prevention 2nd Grand Challenge Track 2 dataset — ~372 days of multi-modal smartwatch biosignals from 8 patients — public (ICASSP 2024 challenge)

Baselines vs proposed

Baseline approach: AVG = 0.437 vs forecasting model AVG = 0.576
Hein et al. [12] (previous challenge winner): AVG = 0.504 vs multi-task model AVG = 0.575
Fusion (weighted α=0.7): AUPRC = 0.664 vs max fusion AUPRC = 0.667
Min fusion: AUROC = 0.510 (highest) but lowest AUPRC = 0.653

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.13816.

Fig 1

Fig 1: The proposed transformer-based relapse and anomaly detection framework where windowed wearable features are

Fig 2

Fig 2: Patient-wise comparison of forecasting and multi-task

Limitations

Small cohort size (8 patients) limits generalizability and robustness of findings.
Relapse detection evaluated on fixed remission-based healthy distributions; no adversarial or out-of-distribution robustness tested.
Sleep measurement data considered noisy, leading to exclusion of sleep head uncertainty in anomaly score estimation.
Late fusion applied at score level; early fusion or integrated multimodal architectures not explored but suggested for future work.
No detailed interpretability analyses on what physiological features drive detection.
No mention of real-time deployment considerations or computational costs on-device.

Open questions / follow-ons

How would early fusion of cardiac and sleep/time signals into a single architecture improve cross-modal interaction learning and performance?
Can interpretability methods be developed to link specific physiological changes to relapse alerts for clinical validation?
How robust is the approach to broader populations, unseen devices, or longer-term monitoring beyond the limited dataset?
Could adversarial or distribution shift scenarios (e.g., sensor failure or user behavior changes) undermine anomaly detection reliability?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper offers insights into modeling human physiological and behavioral patterns using advanced Transformer encoders combined with uncertainty estimation via ensemble MLPs, which may inspire methods for anomaly detection in sequential biometric or behavioral streams. The approach’s fusion of complementary modalities and reliance on predictive uncertainty to derive anomaly scores can generalize to detecting abnormal user states or behaviors in security settings.

The explicit emphasis on late fusion of asynchronous multimodal anomaly signals rather than early joint modeling points to robust ensemble design principles that can mitigate false positives or missed detections when monitoring complex multi-signal data. While focused on clinical relapse, the methods for time-aware embedding, patient-specific baseline modeling, and calibrating thresholding under severe class imbalance offer transferable concepts for user authentication anomaly detection leveraging temporal and physiological biometrics.

Cite

bibtex

@article{arxiv2605_13816,
  title={ Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion },
  author={ Nikolaos Tsalkitzis and Panagiotis P. Filntisis and Petros Maragos and Niki Efthymiou },
  journal={arXiv preprint arXiv:2605.13816},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.13816}
}

Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​