Evaluating the long-term viability of eye-tracking for continuous authentication in virtual reality

Source: arXiv:2502.20359 · Published 2025-02-27 · By Sai Ganesh Grandhi, Saeed Samet

TL;DR

This paper asks a very specific question that many continuous-authentication systems sidestep: if gaze is a strong biometric today, does it still work months later in VR? The authors evaluate eye-tracking-based user identification on the longitudinal GazeBaseVR dataset, using three model families—Transformer Encoder, DenseNet, and XGBoost—under short-term (same-round) and long-term (Round 1 to Round 3, 26 months apart) conditions. The core contribution is not a new biometric signal, but an empirical stress test of temporal robustness.

The result is stark: models that look strong in the short term largely collapse under long-term drift. On Round 1 same-round splits, Transformer Encoder and DenseNet reach up to 97.77% and 97.09% accuracy respectively, while XGBoost is notably weaker. But when trained on Round 1 and evaluated on Round 3 after 26 months, accuracies fall to single digits for most settings, with the worst reported task/model at 1.78% (XGBoost on video viewing). The paper’s practical answer is periodic retraining with recent data; when Round 3 data are added back into training, performance rebounds to 93.25%–98.71% depending on model/task, suggesting that gaze-based continuous authentication is viable only if the system is continually refreshed.

Key findings

On same-round Round 1 evaluation, Transformer Encoder reached 97.77% accuracy on the vergence (VRG) task and 97.20% on all tasks combined; DenseNet reached 97.09% on all tasks combined (Table 2).
When trained on Round 1 and tested on Round 3 collected 26 months later, overall accuracy dropped to 3.01% for Transformer, 7.79% for DenseNet, and 4.85% for XGBoost (Table 3).
The worst long-term result reported was XGBoost on video viewing (VID): 1.78% accuracy when trained on Round 1 and tested on Round 3 (Table 3).
In the short-term setting, XGBoost underperformed neural models substantially: 79.31% on all tasks vs 97.20% for Transformer and 97.09% for DenseNet (Table 2).
After retraining on Round 1 + Round 3 and testing on Round 3, performance recovered to 98.71% for DenseNet and 96.52% for Transformer on all tasks combined (Table 4).
Task stability differed by behavior: TEX and VRG were relatively more robust long-term than VID, with long-term accuracies up to 12.11% (XGBoost on TEX) and 11.46% (XGBoost on VRG), while VID remained very poor across models (Table 3).
The dataset used is GazeBaseVR: 407 participants, 5,020 recordings, 3 rounds over 26 months, sampled at 250 Hz with binocular 3D gaze vectors (Section 2.1 and 3.1).

Threat model

The implicit adversary is an unauthorized person who has obtained a valid login session or is trying to remain undetected in a VR environment after initial authentication. The system is assumed to observe only eye-tracking streams from the headset and to distinguish legitimate users from impostors based on behavioral patterns. The paper does not model a sophisticated attacker who can manipulate the sensor, replay traces, or deliberately mimic gaze dynamics, and it does not evaluate privacy leakage from the biometric itself.

Methodology — deep read

Threat model and goal: the paper frames the problem as continuous authentication in VR, where the main security risk is session hijacking after a legitimate login. The attacker model is implicit rather than formal: an unauthorized person may gain access to a session after initial authentication, and the system should keep verifying the user based on behavior. The paper does not evaluate active spoofing, replay, adversarial mimicry, or sensor tampering. Instead, it focuses on whether a model trained on a user’s gaze patterns remains valid as those patterns drift over long time spans.

Data and splits: the authors use GazeBaseVR, a longitudinal VR eye-tracking dataset collected over three recording rounds spanning 26 months. The paper says the study began with 465 individuals and 58 were excluded, leaving 407 users for the experiments. Each round included two sessions about 30 minutes apart. The dataset has 5,020 recordings in total and covers five tasks: vergence (VRG), smooth pursuit (PUR), video viewing (VID), reading (TEX), and random saccades (RAN). The eye tracker sampled binocular gaze at a nominal 250 Hz and provides 3D unit vectors for each eye along with timestamps. The authors use features n, clx, cly, clz, crx, cry, crz, and add a user label column for multiclass classification. They do not report per-user balancing details, class counts per split, or any stratification logic beyond an 80:20 split for the short-term XGBoost experiment and train/test round pairings for the longitudinal experiments.

Preprocessing and representation: raw timestamps are converted from milliseconds to seconds. Spatial coordinates are min-max normalized to [-1, 1]. The continuous stream is segmented into fixed windows of 1,250 samples, which corresponds to 5 seconds at 250 Hz. Each window is reshaped into a 3D tensor organized as windows × coordinates × time points, with the intent of preserving sequential structure for the Transformer and DenseNet models. The paper does not specify whether windows overlap, how missing samples were handled, whether outliers were removed, or whether normalization statistics were computed globally or only on training data. Those omissions matter, because leakage in time-series normalization could inflate results, but the text does not clarify the protocol.

Models and algorithms: three classifiers are compared. XGBoost uses flattened 2D input from the 3D gaze windows and is trained as a multi-class classifier with objective multi:softmax, learning rate 0.3, and max depth 6. The Transformer Encoder is the novel baseline in this paper: it uses dmodel=64, 4 attention heads, 2 encoder layers, positional encoding, multi-head self-attention, and feed-forward blocks. The paper presents the standard attention equation but not the exact input projection shape or whether the 7 raw features are embedded per timestep or per flattened channel sequence. DenseNet is adapted from EKYT-style eye-movement biometrics: the paper describes an initial convolutional layer followed by dense blocks with increasing dilation rates and cross-entropy loss. It does not provide the exact number of layers, growth rate, kernel sizes, or the embedding dimensionality used in DenseNet, so that architecture is only partially specified. The key novelty is not a new loss or objective; it is the use of a compact Transformer Encoder as a gaze-biometric classifier and the longitudinal comparison against EKYT/DenseNet-style modeling.

Training regime and evaluation: both Transformer and DenseNet are trained for 50 epochs with Adam at learning rate 0.001 and batch size 32; loss is cross-entropy. The paper does not mention validation splits, early stopping, seed control, or repeated runs, so reported numbers appear to be single-run accuracies rather than averages with confidence intervals. Evaluation is by multiclass user-identification accuracy across the five tasks and an aggregate “All” setting. The short-term evaluation trains and tests within Round 1 using an 80:20 split. The long-term test trains on Round 1 and tests on Round 3, which is the main stress test for temporal drift. The “updated data” experiment trains on Round 1 + Round 3 and then tests on previously unused Round 3 data; however, the text is not explicit about how the Round 3 data were partitioned to avoid training on the exact test samples, so there is some ambiguity in the reuse protocol.

Concrete example: consider the long-term TEX task. A Round 1 window is first normalized, segmented into a 5-second sequence, and passed to the classifier. In the Round 1→Round 3 setting, the learned user representation no longer matches the same user’s gaze distribution 26 months later; performance is only 8.40% for DenseNet, 3.71% for Transformer, and 12.11% for XGBoost. After adding Round 3 data into training, performance on the same task rises to 98.75% for DenseNet and 98.22% for Transformer (Table 4). That contrast is the paper’s main empirical claim: the biometric signal is strong, but only if the model is updated to track drift. Reproducibility remains limited because the paper does not report code, frozen checkpoints, exact preprocessing scripts, random seeds, or full hyperparameter sweeps.

Technical innovations

First longitudinal stress test of gaze-based continuous authentication in VR across a 26-month gap using GazeBaseVR rather than only same-session or short-gap evaluation.
Introduces a compact Transformer Encoder (dmodel=64, 4 heads, 2 layers) for eye-tracking user identification and compares it directly with DenseNet/EKYT-style and XGBoost baselines.
Shows that periodic retraining with recent rounds can recover accuracy from single digits back to the mid-to-high 90% range on the same dataset and tasks.

Datasets

GazeBaseVR — 5,020 recordings from 407 users over 3 rounds across 26 months — public dataset (developed by Lohr et al.)

Baselines vs proposed

DenseNet (short-term, all tasks): accuracy = 97.09% vs proposed Transformer = 97.20% (Table 2)
XGBoost (short-term, all tasks): accuracy = 79.31% vs proposed Transformer = 97.20% (Table 2)
DenseNet (long-term Round 1→Round 3, all tasks): accuracy = 7.79% vs proposed Transformer = 3.01% (Table 3)
XGBoost (long-term Round 1→Round 3, VID): accuracy = 1.78% vs proposed Transformer = 2.45% (Table 3)
DenseNet (updated Round 1+3 training, all tasks): accuracy = 98.71% vs proposed Transformer = 96.52% (Table 4)
XGBoost (updated Round 1+3 training, all tasks): accuracy = 93.25% vs proposed Transformer = 96.52% (Table 4)

Limitations

No formal attacker model or spoofing/replay evaluation; the paper measures longitudinal drift, not resistance to active imitation attacks.
No confidence intervals, statistical significance tests, or multi-seed averages are reported, so robustness of the exact percentages is unclear.
Preprocessing and split details are underspecified: overlap, leakage control, normalization scope, and class balancing are not fully described.
The updated-data experiment is promising but the partitioning of Round 3 into train vs held-out test is not fully transparent in the text.
DenseNet architecture details are incomplete compared with the Transformer description, making exact replication difficult.
Accuracy is the only reported metric; no EER, FAR/FRR, ROC/AUC, or latency/compute trade-off is provided.

Open questions / follow-ons

What retraining interval minimizes accuracy decay while keeping operational cost acceptable in a real VR deployment?
How much of the long-term drop is due to behavioral drift versus hardware/session/context changes across rounds?
Would combining gaze with head motion or hand motion reduce the need for frequent model updates?
Can a formal open-set continuous-authentication evaluation replace closed-set user identification and yield a more realistic security estimate?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, this paper is mainly a warning about temporal brittleness: a biometric signal that is highly separable today can become nearly useless after long gaps unless the model is refreshed. If gaze were used as a friction-reducing continuous-auth factor, operational design would need drift monitoring, scheduled retraining, and careful rollback controls, not just a one-time enrollment model.

A practical takeaway is that longitudinal evaluation should be a standard part of biometric risk assessment. In a CAPTCHA or bot-detection context, the paper suggests that user-behavior embeddings can look excellent in short-horizon benchmarks yet fail catastrophically under real-world time shift. Any deployment that relies on gaze for trust scoring should therefore test cross-month generalization, not just same-session classification, and should assume that retraining is part of the authentication protocol rather than an optional maintenance step.

Cite

bibtex

@article{arxiv2502_20359,
  title={ Evaluating the long-term viability of eye-tracking for continuous authentication in virtual reality },
  author={ Sai Ganesh Grandhi and Saeed Samet },
  journal={arXiv preprint arXiv:2502.20359},
  year={ 2025 },
  url={https://arxiv.org/abs/2502.20359}
}

Evaluating the long-term viability of eye-tracking for continuous authentication in virtual reality ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​