Automated Detection of Urological Events in Bladder Pressure Signals with a Two-Stage Machine Learning Framework Validated on External Datasets

Source: arXiv:2605.21878 · Published 2026-05-21 · By Hassaan A. Bukhari, Vikram Abbaraju, Jay Patel, Becky Clarkson, Shachi Tyagi, Margot S. Damaser et al.

TL;DR

This work addresses the challenge of automating detection of clinically important urological events—abdominal activity (ABD), detrusor overactivity (DO), and voiding contractions (VOID)—from single-channel vesical pressure (Pves) signals recorded during urodynamics (UDS). Conventional UDS uses invasive dual catheterization and manual labeling, which is uncomfortable and subjective. With emerging catheter-free devices measuring only Pves becoming available, automated event detection from single-channel signals is critical. The authors present a two-stage multilayer perceptron (MLP) framework that first distinguishes VOID from non-VOID, then classifies non-VOID as ABD or DO. They extract 55 statistical features from discrete wavelet transform (DWT) decompositions of short 0.8-second Pves segments, aggregate features for event-level classification, and train/test on three independent human UDS datasets totaling 118 annotated traces and 233,338 segments. The system achieves up to 84% accuracy (balanced accuracy 76%) and AUC 0.85 for VOID vs non-VOID, and up to 90% accuracy (balanced accuracy 80%) and AUC 0.87 for ABD vs DO on external test sets. Additional cross-dataset experiments confirm good generalization. A proof-of-concept test on independent wireless catheter-free ambulatory Pves data shows the model detects physiologically meaningful events consistent with patient-reported symptoms. This work demonstrates the feasibility of accurate automated urodynamic event classification from single-channel bladder pressure signals, enabling ambulatory and home-based monitoring applications.

Key findings

Stage 1 (VOID vs non-VOID) achieved 84% overall accuracy, 76% balanced accuracy, F1-macro 0.74, and AUC 0.85 on external Dataset C.
Stage 2 (ABD vs DO) reached 90% overall accuracy, 80% balanced accuracy, F1-macro 0.80, and AUC 0.87 on external Dataset C.
A cascaded two-stage MLP combined achieved 77% accuracy and AUC up to 0.73 for three-class classification (ABD, DO, VOID) on Dataset C.
Permutation Feature Importance analysis showed most of the 55 wavelet-based features contributed meaningfully; top features came mainly from wavelet approximation and detail coefficients.
Cross-dataset training and testing confirmed generalizability, with stable performance across multiple train-test splits and datasets (Datasets A, B, C).
Proof-of-concept evaluation on 24-hour wireless catheter-free ambulatory Pves data captured clinically relevant DO and VOID events consistent with patient reports despite no preprocessing.
DO classification accuracy was lower than ABD and VOID due to fewer DO samples and greater signal variability.
Model training and evaluation completed efficiently on a standard CPU/GPU platform (<1 minute per experiment).

Threat model

The adversary in this context is noise, artifacts, and physiological variability inherent in vesical pressure (Pves) signals recorded during urodynamics, which may obscure or distort the signatures of urological events (ABD, DO, VOID). The model assumes no direct access to abdominal pressure (Pabd), catheter-free measurement constrains sensor information to single-channel Pves only. The model cannot manipulate or control the recording environment but must generalize across interpatient variability and device differences to reliably classify events.

Methodology — deep read

The authors begin by defining a clinically motivated threat model: detecting urological contraction events (ABD, DO, VOID) from single-channel Pves bladder pressure signals recorded invasively during urodynamic studies. The adversary is essentially signal noise and interpatient variability; the model must generalize across datasets without direct dual catheter signals (Pabd).

Data originates from 118 annotated UDS traces from 76 human patients with neurogenic or overactive bladder, compiled into three independent datasets: Dataset A (64 female neurogenic bladder traces), Dataset B (20 female overactive bladder), and Dataset C (34 male neurogenic bladder SC injury). Sampling rates were standardized by downsampling all recordings to 10 Hz. Each trace is manually annotated by expert urologists labeling ABD, DO, VOID, and NONE events.

The vesical pressure (Pves) signals were segmented into 0.8-second non-overlapping windows (8 samples per segment at 10 Hz), resulting in 233,338 segments across all datasets. Each segment is labeled according to the underlying event. To extract discriminative time-frequency features suitable for nonstationary signals, a 5-level discrete wavelet transform (DWT) with Daubechies 2 (Db2) mother wavelet was applied, decomposing signals into 5 approximation and 5 detail coefficient sets. Cubic spline interpolation aligned coefficient lengths.

From each of the 10 wavelet coefficient signals per segment, four statistical features were computed: maximum, mean absolute value, median, and Shannon entropy, totalling 40 wavelet-based features per segment. Additionally, cross-correlations between approximation and detail coefficients at each level were computed, and maximum, mean, and median values derived, adding 15 more features for a total of 55 per segment.

Because the dataset is imbalanced (NONE is majority), 120,161 NONE segments were sampled to approximate the combined counts of ABD (23,835), DO (22,748), and VOID (71,250) for balanced training. Consecutive segments of the same event were grouped into events, and median feature aggregation across segments computed to form event-level feature vectors.

The core model is a two-stage multilayer perceptron (MLP). Input layer size matches 55 features. Two hidden layers with 128 and 200 neurons respectively, using ReLU activations, follow. Output layer has 2 neurons per stage with softmax for class probabilities.

Stage 1 is a binary classifier: VOID vs non-VOID. Stage 2 further classifies the non-VOID samples into ABD vs DO. Feature selection via ReliefF ranked features to optimize discrimination.

Training used Adam optimizer, learning rate 0.001, cross-entropy loss, 50 epochs, batch size 128. Standard scaling (z-score) normalized features using training set statistics to avoid leakage. Training was done on a 12-core Intel CPU with 16GB RAM and GPU (15.8GB VRAM), completing training and evaluation within 1 minute per run.

Evaluation included multiple settings: internal stratified random splits (60% train, 40% test), external validation training on Datasets A+B and testing on unseen Dataset C, and training on Dataset B alone with testing on combined A+C. Additional splits (80%-20%, 70%-30%) and train/test configurations verified robustness.

Performance metrics included overall accuracy, balanced accuracy, class-wise sensitivity and specificity, F1-macro, and AUC based on ROC curves. Confusion matrices visualized detailed classification performance.

Permutation Feature Importance (PFI) assessed the contribution of each feature by measuring the drop in accuracy when values permuted randomly, confirming multiple wavelet-based features impact performance.

The authors also performed a proof-of-concept test applying models trained on invasive UDS datasets directly to 24-hour wireless catheter-free ambulatory bladder pressure data from one patient, without preprocessing. Physiological event predictions matched patient-reported symptoms, demonstrating transfer to real-world ambulatory settings.

Besides the primary two-stage MLP approach, two alternative architectures were assessed: a cascaded two-stage MLP producing unified three-class output, and a single-stage MLP directly classifying three classes. These helped compare hierarchical vs flat classification effects. The two-stage approach performed best in binary subtasks but error compounding limited three-class gains.

Overall, this methodology rigorously integrates domain knowledge in feature design (wavelet analysis), event segmentation, hierarchical MLP modeling, multiple independent UDS datasets for generalization, thorough evaluation metrics, and real-world ambulatory validation, enabling robust automated urological event detection from single-channel Pves.

Technical innovations

Use of a two-stage multilayer perceptron framework hierarchically classifying VOID vs non-VOID, then ABD vs DO, addressing class imbalance and improving fine-grained event discrimination.
Extraction of 55 statistical features combining discrete wavelet transform coefficients and cross-correlations from short 0.8-second Pves segments, enabling effective time-frequency characterization of nonstationary bladder pressure signals.
Aggregation of segment-level features by median over consecutive same-class segments to form stable event-level feature representations, reducing noise and enhancing classification.
Extensive external validation strategy including cross-dataset training/testing and real-world proof-of-concept evaluation on wireless catheter-free ambulatory Pves data demonstrating model generalizability and clinical applicability.

Datasets

Dataset A — 64 female neurogenic bladder UDS traces — Cleveland VA and Cleveland Clinic
Dataset B — 20 female overactive bladder UDS traces — Cleveland VA and Cleveland Clinic
Dataset C — 34 male neurogenic bladder SCI UDS traces — Cleveland VA and Cleveland Clinic
UM ambulatory dataset — 24-hour wireless catheter-free Pves recording from 1 patient — University of Pittsburgh proprietary device

Baselines vs proposed

Single-stage MLP (ABD, DO, VOID): accuracy = 74% vs two-stage MLP (cascaded): 77% on Dataset C
Stage 1 MLP (VOID vs non-VOID): accuracy = 84%, AUC = 0.85 on Dataset C
Stage 2 MLP (ABD vs DO): accuracy = 90%, AUC = 0.87 on Dataset C
Single-stage MLP specificity for DO: 70.3% vs multi-channel methods up to 92.9% (prior work [19]) but limited to single-channel Pves here
Performance stable across various train-test splits (60-40%, 70-30%, 80-20%) and dataset combinations

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.21878.

Fig 1

Fig 1: Three representative Pves (vesical pressure) signals manually annotated for voiding contractions (Void, black), abdominal

Fig 2

Fig 2: Overview of the data processing and classification

Fig 3

Fig 3: ROC curves for the full classification pipeline. Panels (ac) show performance under different training and testing

Fig 4

Fig 4: Permutation Feature Importance (PFI) for the MLP

Fig 5

Fig 5: Actual (A) and predicted (P) event annotations over bladder pressure (Pves) traces from six subjects in external validation

Fig 6

Fig 6: Example of proof-of-concept evaluation using ambulatory Pves recordings from the UM device. Panels a–d show

Fig 7

Fig 7 (page 15).

Fig 8

Fig 8 (page 16).

Limitations

DO event classification accuracy is lower than ABD and VOID due to fewer DO samples and greater inter-dataset variability, limiting generalization.
Manual annotation of events introduces label noise and interobserver variability; some model 'errors' correspond to potential manual labeling inconsistencies.
Proof-of-concept ambulatory evaluation done on only one patient, limiting conclusions about real-world performance.
No explicit robustness tests against sensor artifacts, noise, or signal dropouts presented.
Model relies on engineered wavelet features and an MLP architecture; deep models operating on raw Pves signals were not evaluated.
Limited patient diversity with specific neurogenic and overactive bladder populations; broader demographics not tested.

Open questions / follow-ons

How does the model perform across a larger, more heterogeneous patient cohort with diverse bladder pathologies?
Can the two-stage MLP framework be improved or replaced by end-to-end deep learning models operating on raw Pves signals for better feature learning?
What are the impacts of real-world ambulatory noise and motion artifacts on classification robustness and how can these be mitigated?
How can multi-modal data (e.g., patient-reported symptoms, bladder volume) be integrated with Pves to improve detection accuracy?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper highlights an approach to automated event detection from noisy, single-channel time series data using interpretable, multi-stage machine learning classification combined with engineered frequency-domain features. The hierarchical classification paradigm focusing first on broad binary distinctions before fine-grained sub-classifications could inspire analogous designs in multi-tier bot detection pipelines, where initial filtering removes obvious benign or malicious cases followed by refined classification in suspicious batches. The use of wavelet transform to extract multi-scale features from non-stationary signals demonstrates how domain-specific signal processing can fuel classifier performance and interpretability, a principle potentially applicable in behavioral biometrics or mouse/touch dynamics analysis for distinguishing bots and humans.

The paper also illustrates rigorous external validation strategies including cross-dataset testing and real-world deployment on different devices, emphasizing robust generalization beyond single lab/curated datasets—an important lesson when designing bot defenses meant to operate reliably under adversarial or shifted distributions. Finally, the interpretability via permutation feature importance offers a method to audit ML decisions, critical for transparency in security-sensitive automated decision systems.

Cite

bibtex

@article{arxiv2605_21878,
  title={ Automated Detection of Urological Events in Bladder Pressure Signals with a Two-Stage Machine Learning Framework Validated on External Datasets },
  author={ Hassaan A. Bukhari and Vikram Abbaraju and Jay Patel and Becky Clarkson and Shachi Tyagi and Margot S. Damaser and Steve J. A. Majerus },
  journal={arXiv preprint arXiv:2605.21878},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.21878}
}

Automated Detection of Urological Events in Bladder Pressure Signals with a Two-Stage Machine Learning Framework Validated on External Datasets ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​