QUACK! Making the (Rubber) Ducky Talk: A Systematic Study of Keystroke Dynamics for HID Injection Detection

Source: arXiv:2604.15845 · Published 2026-04-17 · By Alessandro Lotto, Francesco Marchiori, Mauro Conti

TL;DR

This paper studies a problem that is adjacent to keystroke biometrics but distinct in a crucial way: not “who is typing?” but “is this a human or an injected HID device?” The authors argue that prior keystroke-dynamics defenses mostly solve user authentication, which does not map cleanly to USB Rubber Ducky-style attacks because the attacker only needs to look human, not impersonate a specific enrolled user. They therefore frame HID injection detection as a privacy-preserving, text-independent, human-vs-machine discrimination task using only timing features (hold time and flight time).

The main contribution is a systematic evaluation pipeline that compares three attacker families of increasing sophistication: naive/randomized timing generators, context-aware statistical generators, and GAN-based adaptive generators. Using the 18,816-session keystroke dataset from González et al. as the human source, they synthesize matched human/machine sessions and train lightweight detectors (RF, SVM, CNN1D, LSTM/BiLSTM) under single-generator and mixed-generator regimes. The headline result is that detection can be made robust with lightweight timing-only models, and that robustness depends more on exposure to structurally diverse generator families than on training against a single “strong” generator. They also quantify a practical observation-window trade-off, reporting that around 70 keystrokes is an important operating point where ROC-AUC exceeds 0.9 across all generators for the RF detector in their setup.

Key findings

The evaluation uses the González et al. keystroke dataset with 18,816 independent typing sessions as the human source, then constructs synthetic machine sessions by preserving virtual-key sequences and replacing timing features (HT, FT) only.
The authors evaluate three attacker families: Naive (mt19937, pcg64, philox; Emp-Pair; Cond-Bin), Statistical (Average, Uniform, Gaussian, Histogram, NS-Hist), and Adaptive (unconditional/conditional WGAN-GP).
In single-generator training, the authors report that at 70 keystrokes the RF detector achieves ROC-AUC > 0.9 across all generators in the Statistical and Adaptive datasets, motivating 70 as a practical timeliness/reliability trade-off.
Cross-generator results show clear clustering within attacker families: PRNG-based Naive generators generalize strongly to each other, while Cond-Bin and Emp-Pair form a separate mutual-transfer cluster with limited cross-cluster transfer.
The paper reports strong asymmetry in cross-family generalization: detectors trained on Statistical generators transfer well to PRNG generators, but not to Cond-Bin or Emp-Pair; detectors trained on Naive generators do not reliably transfer to Statistical generators.
Histogram and NS-Hist are singled out as the most transferable Statistical generators, showing stable transferability within their family and good generalization even against GAN-generated Adaptive data.
GAN-based training does not monotonically improve evasion: detectors trained on GAN-generated data do not consistently generalize back to simpler generators, which supports the authors’ claim that diversity of attacker families matters more than adversarial model sophistication.
The paper states that the experimental pipeline and source code are publicly released in a GitHub repository for reproducibility.

Threat model

The adversary has physical access to the target machine and injects keystrokes through a malicious HID device such as a USB Rubber Ducky. They can fully control injection timing, including slowing down input, adding random delays, using context-aware statistical generation, or training GAN-based timing synthesizers. They do not need to impersonate a specific enrolled user; they only need to make machine-generated input look human enough to evade a user-agnostic detector. The defender sees OS-level keystroke timing but does not rely on content, device fingerprinting, or hardware changes, and should treat all real human users as legitimate despite natural variation.

Methodology — deep read

The threat model is a physical-access HID injection attack: the adversary plugs in a malicious keyboard-emulating device such as a USB Rubber Ducky and injects arbitrary keystroke sequences into a workstation. The defender is assumed to observe keystroke events at the OS input layer in real or near-real time, but only the timing side channels—hold time (HT) and flight time (FT)—not text content. This is important because the paper explicitly rejects user-centric authentication framing: the attacker does not need to impersonate a particular enrolled user, only to evade a generic human-vs-machine detector. The defender is also assumed to have offline access to human typing data for training, but the detector must be user-agnostic at deployment and should not rely on device fingerprinting or hardware modifications.

For data, they start from the publicly available dataset introduced by González et al., which contains 18,816 independent typing sessions. Each session includes virtual-key codes plus timing features HT and FT. They preserve the original VK sequence to keep alignment between human and synthetic sessions, but they exclude VKs from detector input so the model cannot exploit semantic text patterns. Synthetic machine sessions are generated by replacing only HT and FT while keeping the same session structure. They organize synthetic data into three attacker families. The Naive dataset includes mt19937, pcg64, and philox PRNG-based generators that sample from empirical ranges, plus Emp-Pair and Cond-Bin lightweight statistical generators. The Statistical dataset includes Average, Uniform, Gaussian, Histogram, and NS-Hist generators designed to capture higher-order and context-dependent timing structure. The Adaptive dataset includes unconditional and conditional WGAN-GP generators trained on human data only. The paper says the GANs are trained for 20,000 steps with batch size 250, with checkpoints saved every 5,000 steps; the selected checkpoint is chosen by checking Wasserstein loss stabilization, HT/FT alignment, variance preservation, and mode-collapse absence. Sessions are split 80/20 at the session level, with strict train/test separation to avoid temporal leakage.

The detector side is intentionally lightweight. They evaluate non-sequence models—Random Forest and SVM—and sequence models—1D CNN and LSTM variants, including BiLSTM. Inputs are fixed-length windows of timing features, not raw text. The methodology is split into three stages. First, single-generator training evaluates detectors trained on one synthetic generator at a time, across sequence lengths ranging initially from 10 to 1000 keystrokes, then narrowed to 10, 30, 50, 70, 100, and 200 after observing saturation. Second, cross-generator testing evaluates whether a detector trained on one generator transfers to others within the same family, across families, and from simpler to more complex attackers. Third, mixed-generator training tries to improve robustness by training on multiple generators at once, comparing balanced configurations (equal representation per generator) and unbalanced configurations that overweight the generators that appear to generalize best. This is the paper’s main methodological answer to the generalization question: rather than building a detector for each attacker type, train on a diverse subset and test whether that covers the space of likely evasive behaviors.

Evaluation is centered on ROC-AUC, with the same number of samples per test dataset to avoid majority-class bias. They also examine the trade-off between detection latency and reliability by varying input sequence length. The authors note that the unidirectional LSTM degraded as sequence length increased on the Naive dataset, so they switched to BiLSTM in later experiments; however, the strongest reported results in the text come from the RF, which achieved ROC-AUC > 0.9 at 70 keystrokes across the Statistical and Adaptive settings. Cross-generator evaluation is visually reported in Fig. 6, where they observe family-specific clusters and asymmetric transfer. The paper’s evaluation logic is essentially: if a detector can learn machine-vs-human structure rather than generator-specific artifacts, it should transfer across different synthetic attackers. Mixed training is then used to test whether a small, carefully chosen set of attacker families can cover a broader threat surface than brute-force training on a single sophisticated model. The source text indicates a public GitHub repository for the experimental pipeline, but the truncated excerpt does not provide frozen weights, exact hyperparameters for all detectors, or detailed statistical significance testing.

Technical innovations

Recasts HID injection defense as user-agnostic human-vs-machine discrimination using only timing features, rather than enrollment-based behavioral authentication.
Introduces a three-family attacker taxonomy for keystroke injection—Naive, Statistical, and Adaptive—and evaluates transferability across and within those families.
Uses mixed-generator training to study whether robustness comes from attacker diversity rather than increasing generator sophistication.
Quantifies the observation-window trade-off for continuous detection and identifies 70 keystrokes as a practical operating point in their experiments.

Datasets

González et al. keystroke dataset — 18,816 independent typing sessions — public dataset
Naive synthetic dataset — size not stated in excerpt — derived from González et al. human sessions
Statistical synthetic dataset — size not stated in excerpt — derived from González et al. human sessions and within-subject synthetic sessions from González et al.
Adaptive synthetic dataset — size not stated in excerpt — generated from human sessions using WGAN-GP models trained by the authors

Baselines vs proposed

Random Forest vs SVM vs CNN1D vs LSTM/BiLSTM: the excerpt reports ROC-AUC curves, but does not provide all exact values in text; Fig. 3–5 show RF as the strongest consistently reported model and motivate focusing on RF at 70-keystroke windows.
Unidirectional LSTM vs BiLSTM on Naive data: the unidirectional LSTM degrades as input length increases, while BiLSTM is adopted later; exact ROC-AUC values are only shown in Fig. 3 and not numerically stated in the excerpt.
RF on Statistical/Adaptive datasets: ROC-AUC > 0.9 at 70 keystrokes across all generators, per the text accompanying Fig. 4 and Fig. 5.
Cross-generator transfer within Naive family: PRNG generators generalize strongly among themselves, while Cond-Bin and Emp-Pair generalize mutually but weakly across clusters; exact numeric ROC-AUC values are shown in Fig. 6 but not textually enumerated in the excerpt.
Cross-family transfer: detectors trained on Statistical generators generalize effectively to PRNG generators, but not reliably to Cond-Bin or Emp-Pair; detectors trained on Naive generators fail to transfer reliably to Statistical generators; the excerpt does not provide exact values.
Adaptive-vs-Statistical transfer: Histogram and NS-Hist generators retain good generalization even against GAN-generated data, while detectors trained on GAN data do not consistently generalize back to simpler generators; exact numeric results are only in Fig. 6e/f.

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.15845.

Fig 1

Fig 1: System and Threat model. The attacker injects synthetic malicious keystrokes

Fig 6

Fig 6: Single-Generator training results for RF detector with 70 keystrokes.

Fig 7

Fig 7: Mixed-Generators training evaluation results.

Fig 5

Fig 5: ROC-AUC training curve on the Adaptive Dataset.

Fig 8

Fig 8: ROC-AUC vs keystroke size for balanced configurations, RF detector.

Fig 9

Fig 9: ROC-AUC vs keystroke size for unbalanced configurations, RF detector.

Fig 7

Fig 7 (page 14).

Fig 10

Fig 10: RF detector inference cost analysis.

Limitations

The excerpt does not provide exact ROC-AUC numbers for most comparisons, so several results are qualitative in the accessible text rather than numerically reproducible from the excerpt alone.
The synthetic machine sessions preserve virtual-key sequences while modifying only timing, which isolates timing realism but may underestimate attackers who also manipulate text structure or command composition.
The paper evaluates only timing features (HT and FT); it does not test richer side channels, device fingerprints, or content-aware defenses, so the privacy/accuracy trade-off is only partially explored.
The source text mentions GAN checkpoint selection criteria, but the full hyperparameter details for detector training (learning rates, regularization, tree depth, SVM kernel settings, CNN/LSTM architecture specifics) are not present in the excerpt.
The evaluation appears to focus on held-out sessions from the same dataset distribution; the excerpt does not mention cross-device, cross-language, or cross-user distribution shift beyond generator transfer.
No statistical significance tests or confidence intervals are described in the excerpt, so it is unclear how stable the reported differences are across random splits or seeds.

Open questions / follow-ons

How well do these timing-only detectors hold up under true distribution shift, such as different keyboard hardware, OS input stacks, typing languages, or accessibility tools?
Can an attacker co-opt the detector by optimizing directly against it, rather than against a proxy generator family, and how much does the mixed-training strategy help against white-box evasion?
What is the minimum training set diversity needed to generalize across realistic HID attackers without overfitting to synthetic generator artifacts?
Can the same timing-only framing detect other input injection modalities, such as mouse automation or touch-event spoofing?

Why it matters for bot defense

For bot-defense practitioners, the paper is useful less as a ready-made detector and more as a framing correction. It argues that the right question is not whether a sequence matches some user profile, but whether its timing structure looks human at all. That distinction matters in CAPTCHA-adjacent defenses because many real systems need continuous, low-friction checks on input streams without collecting text content or building per-user biometrics. A timing-only classifier over keystroke bursts could be used as one weak signal in a larger risk engine, especially for post-authentication command execution or high-risk form submissions.

The most actionable takeaway is the training lesson: diversity matters more than chasing the fanciest generator. If you want a detector that survives basic randomization plus more advanced synthetic timing, you should expose it to structurally different attacker families during development and test transfer across families, not just in-distribution performance. The other practical takeaway is the windowing trade-off. The paper’s 70-keystroke operating point suggests that early interception is feasible, but it also implies that short windows may be noisy. In production, that means you would likely combine this signal with other passive features and treat it as an early-warning indicator rather than a sole blocking decision.

Cite

bibtex

@article{arxiv2604_15845,
  title={ QUACK! Making the (Rubber) Ducky Talk: A Systematic Study of Keystroke Dynamics for HID Injection Detection },
  author={ Alessandro Lotto and Francesco Marchiori and Mauro Conti },
  journal={arXiv preprint arXiv:2604.15845},
  year={ 2026 },
  url={https://arxiv.org/abs/2604.15845}
}

QUACK! Making the (Rubber) Ducky Talk: A Systematic Study of Keystroke Dynamics for HID Injection Detection ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​