Learning from a single labeled face and a stream of unlabeled data
Source: arXiv:2604.27564 · Published 2026-04-30 · By Branislav Kveton, Michal Valko
TL;DR
This paper addresses the challenging problem of face recognition in an extremely limited labeled-data regime: only one labeled image of a single person is available, and all other face images from many other people are unlabeled. This setting is motivated by practical scenarios such as personal device authentication, where only the owner's face is labeled and negative examples (other people) are unknown and unlabeled. The authors formalize the problem as one-class classification and propose an online manifold tracking (OMT) algorithm. OMT incrementally constructs a non-parametric model of the face manifold from a single labeled image and a stream of unlabeled faces. The method uses online k-center clustering to maintain a compact set of representative faces and semi-supervised graph-based random walk absorption probabilities to infer identity. Experimental evaluation on the VidTIMIT video dataset with 43 people shows OMT can recognize the labeled person with an average true positive rate (TPR) of 89% at a very low false positive rate (FPR) of 10^-4, significantly outperforming nearest neighbor baselines trained on the same labeled data. Additionally, OMT complements discriminative features such as Fisherfaces to further boost accuracy. Extensive sensitivity analyses demonstrate the robustness of the approach to hyperparameter choices such as the generalization radius and number of representative faces.
Key findings
- OMT achieves 89% true positive rate at 10^-4 false positive rate on VidTIMIT with only one labeled image of one person and a stream of unlabeled data.
- OMT improves TPR by 50% over 1-nearest neighbor baseline at the same FPR, demonstrating effective use of unlabeled data.
- Performance of OMT with one labeled face matches roughly that of 5-nearest neighbor classifier trained on 5 labeled faces.
- Using better features (Fisherfaces) improves performance of both OMT and 1-NN baseline; OMT with Fisherfaces outperforms both methods individually by double digits at low FPRs.
- Increasing number of representative faces k from 150 to 600 improves TPR and reduces cover radius r, with cubic theoretical cost but close to quadratic observed.
- Optimal generalization radius R approximately 0.3 maximizes TPR at low FPR; too small R undergeneralizes, too large R increases false positives.
- OMT processes face recognition in near real time, averaging approximately 0.02 seconds per frame (from 45 seconds per 2·1062-frame video).
- OMT is robust to parameter tuning and can track evolving face manifolds online without offline training.
Threat model
The adversary is an unknown user appearing in the video stream whose face is unlabeled and treated as a negative example. They have access to the camera stream but do not have labeled images. The defender only has a single labeled image of the genuine user. The adversary cannot directly bypass the system or manipulate the labeling process, but may attempt to trigger false positives by presenting similar faces. The model's goal is robust one-class classification of the genuine user against unknown multiple negatives without explicit negative labels.
Methodology — deep read
The authors consider the threat model of authenticating a single person from a single labeled face image (the positive class). The adversary is essentially the presence of unknown faces from other people (negatives) appearing in a video stream, which are unlabeled and potentially adversarial. No labeled negative examples are provided. The problem is cast as one-class classification with only one labeled positive example.
Data provenance is the VidTIMIT dataset containing video sequences of 43 people reciting sentences across three sessions. The environment is office-like, videos are 512x384, grayscale 96x96 face crops are extracted using OpenCV face detection and histogram equalization applied. The dataset is augmented with outliers by inserting negative face images from other people randomly into the video streams to simulate a challenging open-set scenario where negatives constantly appear.
The core algorithm, Online Manifold Tracking (OMT), builds a compact online summary of the unlabeled face manifold near the labeled example using online k-center clustering to maintain up to k representative faces within a radius r of each other. The generalization radius R controls how far from the labeled example the unlabeled faces are allowed to be included.
Identity inference is performed using graph-based semi-supervised learning. A similarity graph W is constructed using a Gaussian kernel on Euclidean pixel-space distances. To avoid trivial solutions due to only one labeled face, a special 'sink' vertex is introduced to absorb random walks with probability controlled by parameter gamma. Absorption probabilities computed from the harmonic solution formulate the recognition score.
The OMT algorithm incrementally updates representative faces from incoming unlabeled frames if they lie within radius R of the labeled face but are sufficiently far (>r) from existing representatives. When the number of representatives hits k+1, the cover radius r is doubled and representatives reclustered to maintain spacing.
Parameters such as the generalization radius R, cover radius r, number of representatives k, heat kernel bandwidth sigma, sink absorption weight gamma, and recognition threshold epsilon are set based on statistical heuristics and sensitivity analysis.
Training is fully online without offline pretraining. Experiments run on all 43 VidTIMIT videos, each with one labeled face and a stream of faces with unlabeled outliers inserted. Performance metrics are true positive rate (TPR) and false positive rate (FPR) measured on frames. Baselines include 1-NN and 5-NN classifiers in pixel and Fisherface spaces. ROC curves are obtained by varying epsilon (OMT) or radius (NN). No cross-validation folds are needed since testing is on the streaming data with known ground truth labels.
One concrete example: For person 1, OMT starts with a single labeled face xl. As frames xt arrive, it checks if d(xt,xl) ≤ R. If so and xt is > r from existing representatives, xt is added as a new representative. The k-center clustering ensures no two reps are closer than r. To infer identity of xt, a similarity graph among reps plus xl plus sink vertex is constructed and absorption probabilities calculated. If the probability exceeds threshold epsilon, xt is accepted as the labeled individual; otherwise rejected. Over time, the evolving representative set captures variations in face appearance including expressions and minor pose changes, enabling robust low-FPR recognition.
Code and datasets are not explicitly released but VidTIMIT is publicly available. The method does not require offline training or labeled negatives, enabling practical real-time deployment.
Technical innovations
- Formulation of face recognition from a single labeled image of one person with a stream of unlabeled faces as a one-class classification and online semi-supervised learning problem.
- Online manifold tracking (OMT) algorithm that incrementally summarizes unlabeled face data using online k-center clustering to maintain a compact and adaptive representation.
- Introduction of a special sink vertex with tunable absorption probability in the similarity graph to distinguish positives from unknown negatives without labeled negative examples.
- Closed-form harmonic solution based identity inference on the evolving graph of representative faces, enabling efficient real-time recognition.
- Demonstration that unlabeled data streams can be used effectively to learn better appearance models in an online fashion without offline training.
Datasets
- VidTIMIT — 43 subjects — publicly available video dataset recorded in office environment with 3 recording sessions
Baselines vs proposed
- 1-NN pixel space: TPR = approx. 0.6 at FPR=10^-4 vs OMT: TPR=0.89
- 5-NN pixel space: ROC curve similar to OMT with one labeled face
- 1-NN Fisherfaces: lower TPR than OMT with Fisherfaces at low FPRs
- OMT Fisherfaces: improves TPR by about 2% vs OMT pixel at low FPRs
- Increasing number of representatives k from 150 to 600 increases TPR and reduces cover radius r
Figures from the paper
Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2604.27564.

Fig 2: Images and faces in the VidTIMIT dataset.

Fig 3: Representative faces learned by OMT for Person 1, 15, 22, and 42. The four leftmost faces are the labeled examples.

Fig 4: Comparison of the NN and OMT recognizers that are

Fig 5: Comparison of the NN and OMT recognizers that are

Fig 6: Varying the generalization radius R in OMT. For each

Fig 7: Varying the number of representative faces k in OMT.

Fig 7 (page 5).

Fig 8 (page 5).
Limitations
- Method relies heavily on availability of a large stream of unlabeled face images; performance may degrade if insufficient unlabeled data is available.
- Experimental evaluation is done on a relatively small dataset (43 people) and limited domain (office environment), so generalization to large-scale, unconstrained settings remains untested.
- No explicit adversarial evaluation or robustness tests against sophisticated impostor attacks are presented.
- The holistic image-based approach may be less robust than local-feature-based methods to extreme pose or illumination changes.
- Computational complexity scales cubically with number of representatives k in theory, which may limit scalability despite observed subcubic costs.
- The method assumes Euclidean pixel intensity space distances and Gaussian kernel similarities, which may not capture complex face variations optimally.
Open questions / follow-ons
- How would the method scale and perform on larger datasets with thousands of people and more diverse conditions?
- How robust is OMT to adversarial presentation attacks or spoofing attempts?
- Can the approach be combined with deep embedding features learned from large datasets to further improve accuracy and robustness?
- What are the performance and robustness trade-offs of incorporating local facial features or multi-modal biometric data?
Why it matters for bot defense
From a bot-defense perspective, this work provides an interesting instance of one-class classification from minimal labeled data combined with abundant unlabeled data streams. It demonstrates how to build adaptive user-specific models for biometric authentication where negative examples are unknown or impossible to enumerate. Methods like online manifold tracking (OMT) that leverage unlabeled data to improve recognition accuracy without explicit negative training samples could inspire novel adaptive CAPTCHA or bot-detection mechanisms that personalize challenge-response models per user. The use of graph-based semi-supervised learning with sink nodes to disambiguate positives from unknown negatives also offers a conceptual tool for handling open-world bot-detection scenarios where attackers are diverse and not fully labeled. However, the approach's reliance on face data and streaming sequences limits direct application to text/image CAPTCHAs. Still, the core idea of online adaptation from streaming unlabeled data with minimal supervision is broadly relevant for anti-bot systems aiming to handle evolving adversaries in a user-specific manner. Careful parameter tuning and computational resource considerations highlighted here are also applicable in practical bot-defense deployments.
Cite
@article{arxiv2604_27564,
title={ Learning from a single labeled face and a stream of unlabeled data },
author={ Branislav Kveton and Michal Valko },
journal={arXiv preprint arXiv:2604.27564},
year={ 2026 },
url={https://arxiv.org/abs/2604.27564}
}