Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs

Source: arXiv:2406.15335 · Published 2024-06-21 · By Debnath Kundu, Atharva Mehta, Rajesh Kumar, Naman Lal, Avinash Anand, Apoorv Singh et al.

TL;DR

This paper addresses the growing challenge of detecting AI-assisted academic dishonesty in online exams and assignments, where traditional plagiarism detectors fail against paraphrased or generative AI-produced text. The authors propose leveraging keystroke dynamics — the detailed timing and rhythm of typing behavior — as a behavioral biometric to distinguish between bona fide (human-only) text composition and AI-assisted writing. To this end, they collected a novel dataset capturing keystroke patterns from 40 students performing writing tasks both independently and with generative AI help, alongside the use of two publicly available keystroke datasets (SBU and Buffalo) for evaluation.

The detection model is based on an enhanced Long Short-Term Memory (LSTM) architecture called TypeNet, augmented with a Siamese network structure to compare pairs of keystroke sequences. The system was evaluated across varied scenarios including user-specific vs user-agnostic, keyboard-specific vs keyboard-agnostic, context-specific vs context-agnostic, and dataset-specific vs dataset-agnostic settings. Performance reached accuracies between approximately 75% to 86% in most condition-specific scenarios, but dropped to about 52% to 80% under context- and user-agnostic settings. The results demonstrate measurable differences in keystroke dynamics between genuine and AI-assisted writing, suggesting behavioral biometrics can supplement traditional content-based plagiarism systems to improve detection reliability in academic integrity enforcement.

Key findings

Condition-specific accuracy ranges from 74.98% (keyboard-specific K3) up to 85.72% on Buffalo dataset (dataset-specific).
User-specific modeling achieves 81.86% accuracy and 81.85% F1 score with FAR of 23.71% and FRR of 26.24%, outperforming user-agnostic models (accuracy ~63.56%–66.54%).
Keyboard-agnostic scenarios maintain stable accuracy between 78.11% and 80.54% with FAR and FRR below 30%, indicating robustness across different keyboard types.
Context-agnostic settings cause accuracy drops to 70.21%–78.67%, FAR rising up to 39.65%, highlighting challenges in generalizing across topics.
Dataset-agnostic evaluation shows significant performance degradation, with accuracies as low as 52.24% and FAR/FRR above 40%, indicating limitations in cross-dataset generalization.
The model benefits from training on combined datasets (e.g., SBU + Buffalo) reaching improved accuracy (73.23%) on testing with the Proposed dataset.
Modified TypeNet incorporated cosine similarity-based loss with Siamese LSTM network to better distinguish assisted vs bona fide typing patterns.
Exclusion of sequences with over 20% Shift key usage and very short sequences improves data quality for training.

Threat model

The adversary is a student who may use generative AI tools (e.g., ChatGPT) during online exams or assignments to produce task responses, attempting to evade plagiarism detection by generating plausible AI-assisted text. The defender has access to detailed keystroke timing data during submission but does not assume direct content overlap detection. The adversary is not assumed to manipulate or spoof the keystroke capture mechanism at a low hardware level, nor can they simulate natural keystroke patterns perfectly under the detection timeframe.

Methodology — deep read

Threat Model & Assumptions: The adversary is a student potentially using generative AI tools (e.g., ChatGPT) to assist in writing exam or assignment responses. The defender has access to precise keystroke timing data during text production but does not rely on content similarity alone. The adversary cannot manipulate or spoof typing biometrics at a low level (e.g., hardware-based emulation) but may attempt to mimic natural writing.
Data: Three datasets were used: (a) SBU dataset with 196 users performing truthful and deceptive writing in reviews and social essays, (b) Buffalo dataset with 148 users transcribing fixed texts across 4 keyboards, (c) Proposed dataset collected from 40 STEM-discipline university students typing responses to opinion- and fact-based questions in two sessions—once unaided, once while allowed to use generative AI. Keystroke events (key-down/up timestamps) are recorded with ±200 microseconds precision. The Proposed dataset contains responses structured to simulate academic exam conditions with controlled cognitive loads.
Architecture & Algorithm: The model is based on the LSTM TypeNet architecture adapted with a Siamese network to learn similarity/differences between pairs of keystroke sequences (free vs fixed text). Each sequence passes through two LSTM layers (128-unit outputs), batch norm, tanh activations, dropout (rate 0.5), and fully connected layers mapping to 128D embeddings. Instead of Euclidean distance, cosine similarity is used in the loss function, optimized with binary cross-entropy loss to classify pairs as same class (bona fide) or different (assisted). Input features include normalized timestamps, keycodes, and key actions.
Training Regime: Sequence length M varied between 25 and 500 keystrokes; batch sizes ranged from 32 to 512. Training stabilized between 50 and 100 epochs using Adam optimizer with learning rate 0.0001–0.005. Training data balanced for pair labels to reduce bias. L2 regularization applied to prevent over-fitting.
Evaluation Protocol: Diverse evaluation scenarios: user-specific (train/test on same users’ sequences split 80-20), user-agnostic (disjoint user sets), keyboard-specific and keyboard-agnostic (training/testing on same vs different keyboard models), context-specific and context-agnostic (training/testing on same or different typing topics), and dataset-specific and dataset-agnostic (train/test on same vs different datasets). Metrics include Accuracy, F1-score, False Acceptance Rate (FAR), and False Rejection Rate (FRR). Decision threshold set at Equal Error Rate (EER) point from ROC analysis.
Reproducibility: Code and dataset details are publicly available at the authors’ Github repository. Datasets SBU and Buffalo are publicly known; Proposed dataset was collected with IRB approval under carefully controlled stimuli and protocols.

Concrete example: In a keyboard-specific scenario for K0 (Lenovo keyboard), the model was trained on 80% of sequences and tested on 20%, achieving 84.64% accuracy and 83.45% F1, with FAR 25.38% and FRR 19.83%. Input keystroke sequences, normalized and transformed, were processed through the modified TypeNet Siamese network to extract embeddings compared via cosine similarity, with cross-entropy loss driving training. Performance measured by classifying sequences as bona fide or AI-assisted based on typing rhythm differences.

Technical innovations

Integration of a Siamese LSTM network architecture adapted from TypeNet to compare pairs of keystroke sequences representing bona fide and AI-assisted writing.
Use of cosine similarity in the loss function instead of Euclidean distance to better capture embedding proximity for classification.
Construction of a custom keystroke dynamics dataset specifically simulating AI-assisted academic writing under varying cognitive loads.
Evaluation across multi-faceted scenarios including user-, keyboard-, context-, and dataset-agnostic settings to reveal generalization challenges.

Datasets

SBU — 196 users — public dataset with truthful and deceptive writings in restaurant reviews and social issue essays
Buffalo — 148 users — public dataset with transcribed fixed texts and free responses across 4 keyboard types
Proposed — 40 university STEM students — custom dataset capturing bona fide and AI-assisted writing under controlled exam-like tasks

Baselines vs proposed

User-specific TypeNet model: Accuracy = 81.86% vs user-agnostic Accuracy = 63.56%–66.54%
Keyboard-specific models: Accuracy ranges 74.98% (K3) to 84.64% (K0) vs keyboard-agnostic Accuracy 78.11%–80.54%
Context-specific Accuracy 76.52%–80.24% vs context-agnostic 70.21%–78.67%
Dataset-specific Accuracy up to 85.72% (Buffalo) vs dataset-agnostic as low as 52.24%
Training on combined datasets improves accuracy to 73.23% vs single-dataset training results below 60% in some cases

Limitations

Significant drop in detection performance under user-agnostic and dataset-agnostic conditions reveals limited cross-user and cross-domain generalizability.
The Proposed dataset, although realistic, was relatively small (40 users) compared to public datasets, limiting broader conclusions.
Keystroke dynamics may be influenced by external factors such as fatigue, mood, or hardware variability not fully controlled in datasets.
No adversarial evaluation conducted to assess if sophisticated cheaters can mimic bona fide keystroke patterns to evade detection.
The model depends on precise, high-fidelity keystroke timing data that may not be universally available in all online exam platforms.
Scope limited to text production; it does not incorporate content analysis or multimodal signals, which could complement detection.

Open questions / follow-ons

How to improve generalizability of keystroke-dynamic models across diverse users and typing contexts to reduce performance drop in user- and dataset-agnostic scenarios?
Can adversarial training or detection-resistance mechanisms be developed to handle deliberate mimicry of bona fide typing behavior by sophisticated cheaters?
What is the impact of integrating multimodal signals (e.g., gaze, mouse movements, content features) with keystroke dynamics on detection accuracy?
How do external factors like device variability, user fatigue, or stress affect keystroke patterns and thus detection reliability over extended time?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this work highlights the potential of behavioral biometrics beyond traditional content or challenge-response systems for distinguishing human versus AI-assisted behavior. Although focused on academic integrity, the approach of leveraging fine-grained typing rhythms via LSTM-based models could inspire enhanced bot-detection features in interactive text input scenarios, such as login forms or high-value transaction platforms. The challenges identified around user and context generalization also underscore the difficulty of deploying behavioral detectors at scale without specialized training data per user or context. Integrating such dynamic keystroke biometrics can supplement standard CAPTCHAs by verifying natural human interaction patterns that are difficult for automated or AI-assisted bots to replicate. However, deployment would require addressing privacy, latency, and robustness considerations similarly.

Cite

bibtex

@article{arxiv2406_15335,
  title={ Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs },
  author={ Debnath Kundu and Atharva Mehta and Rajesh Kumar and Naman Lal and Avinash Anand and Apoorv Singh and Rajiv Ratn Shah },
  journal={arXiv preprint arXiv:2406.15335},
  year={ 2024 },
  url={https://arxiv.org/abs/2406.15335}
}

Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​