KVC-onGoing: Keystroke Verification Challenge

Source: arXiv:2412.20530 · Published 2024-12-29 · By Giuseppe Stragapede, Ruben Vera-Rodriguez, Ruben Tolosana, Aythami Morales, Ivan DeAndres-Tame, Naser Damer et al.

TL;DR

The Keystroke Verification Challenge - onGoing (KVC-onGoing) addresses the lack of standardized benchmarking protocols and large-scale, publicly accessible datasets for keystroke dynamics (KD) biometric verification. The challenge leverages the massive Aalto University Keystroke databases, containing tweet-length variable transcript text sequences from over 185,000 subjects across desktop and mobile platforms, simulating real-life typing conditions. This unprecedented scale and variability, combined with a rigorous, open-set experimental design and demographic annotations, enable meaningful comparisons of KD verification systems under a unified framework. The evaluation results demonstrate substantial advancements over prior state-of-the-art, achieving as low as 3.33% global Equal Error Rate (EER) on desktop and 3.61% on mobile benchmarks, with consistent improvements in False Non-Match Rates (FNMR) at 1% False Match Rate (FMR).

Beyond pure biometric performance, the challenge provides comprehensive fairness metrics broken down by demographic groups, revealing that age and gender do influence verification scores to a non-negligible degree in some settings. The ongoing nature of the challenge on the CodaLab platform facilitates continuous progress and reproducible evaluation by the research community. A detailed comparison of multiple submissions highlights the effectiveness of diverse deep learning architectures, including dual-branch CNN-RNN hybrids, pure CNNs with ArcFace loss, and Transformer-based models. KVC-onGoing thus sets a new standard dataset and protocol for keystroke verification research, emphasizing large-scale, unconstrained text input, demographic fairness, and rigorous open-set evaluation.

Key findings

Global Equal Error Rate (EER) as low as 3.33% on desktop keyboard verification and 3.61% on mobile keyboard verification tasks using the LSIA dual-branch CNN-RNN model.
False Non-Match Rate (FNMR) of 11.96% (desktop) and 17.44% (mobile) at 1% False Match Rate (FMR) threshold, indicating robust verification under low false acceptance constraints.
The challenge's evaluation set comprises 15,000 desktop and 5,000 mobile subjects disjoint from the development set, supporting open-set evaluation.
Dual-branch architecture combining bidirectional GRUs and multi-layer 1D CNNs with temporal and channel attention mechanisms outperforms single-branch architectures.
ArcFace loss combined with a CNN architecture (VeriKVC team) achieves the second-best EER: 4.03% desktop and 3.78% mobile scenario.
Triplet loss-based methods, while popular, generally exhibit higher EERs, such as Keystroke Wizards' models scoring 5.22% (desktop) and 5.83% (mobile).
Fairness metrics indicate demographic factors (age and gender) influence score distributions; subject-level thresholding improves performance compared to global thresholds.
Use of real, variable transcript text rather than fixed-text inputs introduces higher intra-class variability and typing errors, better simulating realistic usage.

Threat model

The adversary is an impostor user attempting to be falsely accepted by the keystroke verification system by reproducing or mimicking the genuine user's keystroke dynamics. The attacker does not have access to hardware-level signals or the ability to tamper with the device input sensors, nor do they have access to enrollment session data beyond what is exposed publicly. The system aims to distinguish genuine from impostor sessions using timing and keycode features from realistic typing sequences.

Methodology — deep read

Threat model & assumptions: The challenge assumes an attacker attempting to impersonate a genuine user by replicating their keystroke behavior. The adversary's knowledge is limited to the publicly available data; they cannot manipulate hardware or insert hardware side-channels. The primary goal is verifying genuine vs impostor typing sessions in unconstrained typing tasks across desktop and mobile keyboards.
Data: The challenge utilizes two large-scale public datasets from Aalto University: the Desktop Keystroke Database (~168,000 subjects) and the Mobile Keystroke Database (~60,000 subjects). Subjects typed sequences of variable English transcript sentences (tweet-length), with 15 sentences per session. For KVC-onGoing, only subjects completing all 15 sentences were included, yielding development and evaluation splits balanced demographically across six age groups and by gender. The development sets contain labeled age and gender; the mobile task's development set includes an additional unlabeled cohort to increase size. Raw data include key press and release timestamps and ASCII key codes. The datasets were preprocessed to extract timing features (Hold Time, Inter-Press Time, Inter-Key Time, Inter-Release Time), synthetic timing features, and per-key embeddings.
Architecture/algorithm: Multiple neural architectures are benchmarked. The top-performing LSIA model adopts a dual-branch design combining bidirectional GRUs (512 units) capturing temporal dependencies and three 1D CNN blocks (filters 256, 512, 1024) with channel attention capturing spatial patterns. Both branches begin with temporal attention layers, and their outputs are concatenated and processed by dense layers to produce an embedding vector, normalized with L2 normalization. Distance metric learning with a specialized set-based loss minimizes global EER. Other models include VeriKVC's CNN with ArcFace loss, Keystroke Wizards' GRU and Transformer hybrids with triplet loss, U-CRISPER's Siamese GRU, and YYama's simplified transformer combined with CNN classifier. Many systems utilize positional encodings (e.g., Gaussian Range Encoding) and multi-head attention modules. The choice of features includes primary timing metrics plus ASCII codes and their derivatives.
Training regime: The LSIA model is trained with batches containing data from 40 users and 15 samples per user (600 samples per batch). A curriculum-based sampling pairs users with nearest neighbors. Batch normalization and dropout prevent overfitting. Training details vary per team; VeriKVC trains for 5000 epochs with AdamW optimizer at learning rate 0.01 and cosine annealing. Keystroke Wizards use 160 epochs with triplet loss. U-CRISPER utilizes Optuna for hyperparameter tuning and trains for 200 epochs. YYama trains embedding models with Adam optimizer and fixed learning rates for 1000-2000 epochs. Batch sizes range from 150 to 2048.
Evaluation protocol: An open-set paradigm where subjects in development and evaluation sets are disjoint are used. Each subject has 5 enrollment sessions and 10 verification sessions, enabling 50 genuine comparisons and 20 impostor comparisons (split evenly between same and different demographic groups). Metrics considered include global and per-subject Equal Error Rate (EER), False Non-Match Rate (FNMR) at False Match Rate (FMR) thresholds (0.1%, 1%, 10%), Area Under Receiver Operating Characteristic (AUC), accuracy, Rank-1 identification. Fairness metrics like Fairness Discrepancy Rate (FDR), Inequity Rate, Gini Aggregation Rate (GARBE), and Skewed Impostor Rate (SIR) are computed to assess demographic performance disparities. Comparisons are submitted on CodaLab with automated scoring.
Reproducibility: The challenge is hosted openly on CodaLab, enabling researchers to submit methods and benchmark results continuously. Data preprocessing scripts and lists of pairwise comparisons are provided. Most top teams provide code repositories. The large-scale Aalto datasets are publicly available but require a research agreement for use. Precise training seed or hardware configurations are not fully detailed but some GPU use is implied. The protocol includes clear instructions for data splits, evaluation, and metric computation.

Example end-to-end: For a given subject in the desktop evaluation set with 5 enrollment and 10 verification sessions, the LSIA dual-branch model processes the feature vectors (key ASCII embeddings plus 4 timing features) of each session through convolutional and recurrent modules with attention, producing an embedding vector normalized to unit hypersphere. Euclidean distances between enrollment and verification embeddings produce genuine and impostor scores. Averaging over enrollment sessions yields stable similarity scores. These are scored against a global threshold to produce verification decisions. Metrics like EER are computed over all subjects’ genuine/impostor scores, with fairness metrics calculated across age and gender groups to reveal biases.

Technical innovations

Introduction of a set-based loss extending SetMargin loss to sets of sets with an additive penalty for embedding compactness across user classes, directly minimizing Equal Error Rate.
A dual-branch model combining bidirectional GRU layers for temporal decision-process modeling and multi-layer 1D CNNs with channel attention for spatial pattern learning in keystroke timing signals.
Provision of an unprecedented large-scale keystroke verification benchmark with over 185,000 subjects and variable transcript text in unconstrained free-text typing scenarios across desktop and mobile.
Comprehensive evaluation framework incorporating both biometric verification metrics and demographic fairness indicators, enabling rigorous analysis of demographic biases.

Datasets

Aalto Desktop Keystroke Database — ~168,000 subjects — Public dataset collected by Aalto University
Aalto Mobile Keystroke Database — ~60,000 subjects — Public dataset collected by Aalto University

Baselines vs proposed

LSIA dual-branch CNN+RNN with set-based loss: Global EER = 3.33% (desktop), 3.61% (mobile) vs VeriKVC CNN + ArcFace: 4.03% (desktop), 3.78% (mobile)
Keystroke Wizards GRU+Transformer (triplet loss): 5.22% (desktop), 5.83% (mobile) vs LSIA 3.33% (desktop), 3.61% (mobile)
U-CRISPER GRU Siamese (triplet loss): 6.19% (desktop), 8.76% (mobile) vs LSIA 3.33% (desktop), 3.61% (mobile)
YYama Transformer+CNN (contrastive + cross-entropy): 6.41% (desktop), 4.16% (mobile) vs LSIA 3.33% (desktop), 3.61% (mobile)
BioSense CNN with attention (cross-entropy): 10.85% (desktop), 11.83% (mobile) vs LSIA 3.33% (desktop), 3.61% (mobile)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2412.20530.

Fig 3

Fig 3: Heat maps (normal and binarized) of the impostor scores of the YYama system with the highest SIRa (top) and SIRg (bottom)

Fig 2

Fig 2 (page 13).

Fig 3

Fig 3 (page 13).

Fig 4

Fig 4 (page 13).

Limitations

While extremely large, the datasets are drawn from a single global web-based typing platform, which may limit representativeness for wider keyboard or language contexts.
The unconstrained transcript text scenario introduces higher intra-class variability which complicates modeling, but fixed-text scenarios remain unexplored in this benchmark.
Demographic fairness analyses reveal some bias remains unmitigated, especially related to age and gender, but mitigation strategies were not tested.
Training details such as exact hardware specifications, random seeds, and hyperparameter tuning strategies are partially opaque, hindering full reproducibility.
The protocols use session-level comparisons but do not address real-time continuous authentication or adversarial imitation attacks explicitly.
No explicit robustness evaluation against spoofing or active attacker manipulations was performed.

Open questions / follow-ons

How can demographic biases identified in verification scores be effectively mitigated without reducing overall system performance?
Can continuous or multi-modal behavioral biometrics incorporating keystroke dynamics improve robustness in unconstrained text entry scenarios?
How resilient are current neural KD models to sophisticated adversarial attacks or spoofing attempts?
What are the trade-offs between fixed-text and free-text scenarios for practical deployment in bot defense or continuous user verification?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work underscores the potential of keystroke dynamics as a transparent, hardware-agnostic behavioral biometric layer that can enhance user authentication and suspicious activity detection. The availability of a large-scale standardized benchmark with realistic, variable-text input supports deploying verification approaches that generalize well to diverse user populations and typing tasks, including on mobile devices. The new state-of-the-art in general verification accuracy (around 3.3-3.6% EER) suggests keystroke biometrics could be effectively integrated as a low-friction continuous authentication or fraud-detection signal in CAPTCHA workflows or risk-analysis systems. However, the demonstrated demographic biases highlight the need for careful fairness auditing and possibly per-user thresholding or adaptive models to prevent disproportionate false rejects or accepts among demographic groups. The community-accessible framework allows bot-defense researchers to rigorously compare novel algorithms on a realistic real-world dataset, accelerating progress toward robust, privacy-sensitive typing-behavior authentication solutions.

Cite

bibtex

@article{arxiv2412_20530,
  title={ KVC-onGoing: Keystroke Verification Challenge },
  author={ Giuseppe Stragapede and Ruben Vera-Rodriguez and Ruben Tolosana and Aythami Morales and Ivan DeAndres-Tame and Naser Damer and Julian Fierrez and Javier Ortega-Garcia and Alejandro Acien and Nahuel Gonzalez and Andrei Shadrikov and Dmitrii Gordin and Leon Schmitt and Daniel Wimmer and Christoph Großmann and Joerdis Krieger and Florian Heinz and Ron Krestel and Christoffer Mayer and Simon Haberl and Helena Gschrey and Yosuke Yamagishi and Sanjay Saha and Sanka Rasnayaka and Sandareka Wickramanayake and Terence Sim and Weronika Gutfeter and Adam Baran and Mateusz Krzysztoń and Przemysław Jaskóła },
  journal={arXiv preprint arXiv:2412.20530},
  year={ 2024 },
  url={https://arxiv.org/abs/2412.20530}
}

KVC-onGoing: Keystroke Verification Challenge ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​