GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

Source: arXiv:2605.00799 · Published 2026-05-01 · By Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Sarah A. Alkhodair, Reem Kateb

TL;DR

GMGaze is a gaze-estimation model designed to address three recurring failure modes in prior appearance-based systems: late fusion of multi-scale features, entangled global representations that mix gaze cues with nuisance factors, and fixed-capacity dense transformers that waste compute on easy tokens. The proposed architecture combines a frozen CLIP image encoder, a ResNet-50 CNN branch, and a multiscale transformer. Its key idea is to split the CLIP global embedding into two context-biased global tokens using learned prototype banks for illumination, background, head pose, and textual gaze descriptions, then fuse those tokens early with CLIP patch tokens and CNN tokens before transformer processing.

The paper reports that this design improves both within-domain accuracy and cross-domain transfer on four public gaze benchmarks: MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze. The stated mean angular errors are 2.49°, 3.22°, 10.16°, and 1.44° respectively, with claims of outperforming prior baselines in all within-domain settings and achieving state-of-the-art results on two standard transfer routes. The method’s other novelty is a sparse Mixture-of-Experts transformer block that allocates capacity token-wise rather than uniformly, plus an adversarial domain adaptation objective with a feature-separation term that pushes the two global tokens to remain decorrelated.

Key findings

Reported within-domain mean angular error is 2.49° on MPIIFaceGaze, 3.22° on EYEDIAP, 10.16° on Gaze360, and 1.44° on ETH-XGaze.
The paper claims GMGaze outperforms previous baselines in all within-domain settings across the four benchmarks.
Cross-domain adaptation uses only 100 labeled target samples for refinement after adversarial alignment, selected by shuffling target indices once with a fixed random seed and taking the first 100.
Semantic prototype conditioning uses four banks: illumination (3 prompts), head pose (2), background (2), and description (8), for 15 total prompt phrases used to initialize the prototype vectors.
The model replaces dense FFNs with sparse MoE routing, where each token is routed independently through Top-K experts rather than applying one shared FFN to all tokens.
The cross-domain objective adds a feature-separation loss that minimizes absolute cosine similarity between the two global semantic tokens, encouraging them toward orthogonality.
The authors state the code is publicly available at https://github.com/AIPMLab/GazeFormer-MoE.
The paper claims state-of-the-art results on two standard transfer routes, but the exact transfer-route names and numerical deltas are not visible in the provided excerpt.

Threat model

The relevant adversary is domain shift rather than a malicious attacker: the model must handle changes in illumination, background, head pose, subject appearance, and dataset-specific distribution statistics between source and target gaze benchmarks. In the cross-domain setting, the source domain is labeled, the target domain is unlabeled during adversarial alignment, and only 100 target samples are later labeled for supervised refinement. The model does not assume access to target gaze labels in the main adaptation stage, but it does assume the usual supervised source data and the ability to compute domain labels for source versus target during adversarial training.

Methodology — deep read

The threat model is standard cross-domain appearance-based gaze estimation rather than an explicit security attacker. For within-domain training, the model sees labeled face images from the same benchmark distribution at train/test time; for cross-domain, it is trained on a labeled source domain and an unlabeled target domain, then refined with a small labeled target subset of 100 images. The adversary in the paper’s framing is not a malicious actor but domain shift: illumination changes, background clutter, head-pose variation, and dataset-specific appearance statistics that cause generalization failure. The model does not assume access to target labels in the main adaptation phase, but it does assume access to the target image stream for adversarial alignment and to a small supervised target subset for refinement.

Data come from four public benchmarks: MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze. The excerpt does not provide dataset sizes, exact train/val/test splits, or preprocessing details such as face-cropping, normalization, or frame subsampling, so those should be treated as unspecified in the provided text. The prompt banks are hand-specified: illumination has three phrases (“bright light,” “low light,” “shadows”), head pose has two (“frontal,” “profile”), background has two (“bright background,” “dark background”), and description has eight gaze-direction phrases spanning left/right/up/down combinations. These prompts are encoded by a frozen CLIP text encoder only once to initialize the learnable prototype vectors; the textual vocabulary and CLIP text encoder itself are frozen thereafter. The CLIP visual encoder is also frozen in the described formulation, while the CNN backbone is trainable and specified as ResNet-50 in the problem formulation.

Architecturally, GMGaze takes a face image and extracts three feature streams in parallel: a CLIP global embedding, CLIP patch tokens, and CNN feature tokens from a ResNet-50 backbone. The semantic-conditioning module is the main novelty: the global CLIP vector is normalized and compared by softmax similarity against four learnable prototype banks. For each bank, a hard one-hot choice is made with a straight-through estimator, so the forward pass uses a discrete prototype while gradients still flow through the soft similarity scores. Two global context-biased tokens are formed: f1 combines the global vector with the selected illumination and background prototypes, while f2 combines it with the selected description and head-pose prototypes. This is intended to create complementary representations rather than one entangled global vector. After that, f1, f2, patch tokens, and CNN tokens are each projected into a common model dimension and concatenated into a single token sequence. The transformer is then applied with multi-head self-attention followed by a sparse MoE feed-forward replacement. Each token is routed independently to Top-K experts; optionally, a modality-to-expert mask can restrict tokens from different token groups to preferred expert subsets. The stated purpose is to give conditional capacity to difficult tokens without scaling all parameters densely.

Training depends on the setting. In within-domain training, the objective is only the angular regression loss: 1 minus cosine similarity between predicted and ground-truth 3D gaze vectors. In cross-domain training, the objective becomes angular loss plus an adversarial domain-alignment loss plus the feature-separation loss. The domain discriminator is a multilayer perceptron trained with a gradient reversal layer to distinguish source from target from the concatenated token features; the gaze model is trained to confuse this discriminator. The separation term explicitly penalizes cosine similarity between the two normalized global tokens, encouraging orthogonality and reducing redundancy. The paper’s algorithms indicate a step-by-step loop: extract CLIP and CNN features, compute prototype similarities and hard selections, build the fused token sequence, run transformer layers with attention and MoE blocks, compute gaze loss on source, update the discriminator on source/target domain labels, and update the gaze model jointly with adversarial and separation losses. The excerpt does not specify optimizer, learning rate, batch size, epoch count, seed strategy, or hardware, so those details are not recoverable from the provided text.

Evaluation is reported on within-domain and cross-domain tasks. The within-domain metric is mean angular error in degrees, and the paper claims results of 2.49°, 3.22°, 10.16°, and 1.44° on MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze, respectively. The paper says these outperform prior baselines in all within-domain settings, but the excerpt does not enumerate the baselines or their exact scores. Cross-domain evaluation uses standard source-to-target transfer routes and includes a small labeled target subset of 100 samples for supervised refinement; the authors say the method achieves SOTA on two standard routes. The excerpt mentions ablation studies and qualitative visualizations, but does not expose the slice-by-slice numbers, so the relative contribution of semantic conditioning, MoE routing, feature separation, and adversarial alignment cannot be quantified here. Reproducibility is mixed: the code is publicly available, but the excerpt does not say whether pretrained weights, exact splits, or a fixed evaluation protocol are released.

Technical innovations

Semantic prototype conditioning splits one frozen CLIP global embedding into two hard-selected context-biased tokens using four learned prototype banks initialized from fixed text prompts.
Early unified fusion concatenates CLIP global tokens, CLIP patch tokens, and CNN tokens at the transformer input rather than merging them in late layers.
Token-wise sparse MoE replaces the standard FFN so different gaze samples and modalities can route through different expert subsets.
Cross-domain adaptation adds a feature-separation loss that explicitly discourages the two semantic tokens from becoming collinear while adversarially aligning source and target domains.

Datasets

MPIIFaceGaze — size not stated in provided excerpt — public benchmark
EYEDIAP — size not stated in provided excerpt — public benchmark
Gaze360 — size not stated in provided excerpt — public benchmark
ETH-XGaze — size not stated in provided excerpt — public benchmark

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.00799.

Fig 1

Fig 1: Gaze visualization of the proposed GMGaze on unseen

Fig 2

Fig 2: Flowchart of GMGaze. Left: Pre-defined prompt banks (Illuminations, Backgrounds, Head poses, and Descriptions) are

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 3

Fig 3: (Left): Example of dense transformer layer, which

Fig 6

Fig 6 (page 3).

Fig 7

Fig 7 (page 3).

Fig 8

Fig 8 (page 3).

Limitations

The provided excerpt omits dataset sizes, exact splits, preprocessing, optimizer settings, batch size, epoch count, and backbone initialization details, so full reproducibility is incomplete from the text alone.
The main claimed gains are aggregate mean angular errors; the excerpt does not expose the ablation deltas that isolate semantic conditioning, MoE routing, or feature separation.
Cross-domain refinement uses 100 labeled target samples, which is practical but still assumes some target annotation budget.
The cross-domain threat model is dataset shift, not a stronger adaptive attacker; the method is not evaluated against adversarially chosen test-time corruptions or test-time prompt attacks.
The prompt-based prototype design is semantically hand-crafted, so performance may depend on the chosen vocabulary and may not transfer cleanly to other gaze datasets or languages.

Open questions / follow-ons

How much of the improvement comes from prototype conditioning versus the MoE transformer, and do the gains persist if the prototype banks are learned without hard selection?
Would the feature-separation loss still help if the two global tokens were conditioned on more than two semantic partitions, or if the prompt taxonomy were derived automatically rather than manually?
How sensitive is the method to the CLIP backbone choice, the number of experts, and the Top-K routing setting?
Can the same early-fusion + MoE design improve robustness under more severe shifts, such as sensor blur, compression artifacts, or synthetic-to-real transfer?

Why it matters for bot defense

For bot defense, the paper is relevant less as a direct CAPTCHA method and more as a robustness pattern: it shows how to decompose one global representation into context-specific subspaces, then fuse them early and route computation sparsely. A captcha or anti-bot system that relies on face/gaze signals could borrow the idea of explicit factor conditioning for nuisance variables like lighting, camera angle, and background, rather than hoping a single embedding will disentangle them implicitly.

The cross-domain setup is also a useful reminder that face- or gaze-based verification will be brittle if trained only on one capture environment. The 100-sample target refinement regime suggests a practical adaptation budget, but it also highlights the privacy and operational cost of collecting labeled target examples. For practitioners, the main takeaway is that if gaze is used as a behavioral signal, you likely need explicit domain-alignment and factor-separation machinery to avoid overfitting to venue-specific cues, screen geometry, or camera placement.

Cite

bibtex

@article{arxiv2605_00799,
  title={ GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer },
  author={ Xinyuan Zhao and Yihang Wu and Ahmad Chaddad and Sarah A. Alkhodair and Reem Kateb },
  journal={arXiv preprint arXiv:2605.00799},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00799}
}

GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​