UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

Source: arXiv:2606.20559 · Published 2026-06-18 · By Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

TL;DR

UNIEGO addresses the fundamental limitation in egocentric video understanding caused by the narrow perspective and single modality of wearable cameras. The key innovation is a hierarchical multi-teacher knowledge distillation framework that enables unified egocentric video representation learning by leveraging diverse teacher models across multiple viewpoints (egocentric and exocentric), modalities (RGB, depth, skeleton), and foundation models. Instead of distilling knowledge directly from heterogeneous and incompatible teacher feature spaces—which leads to conflicting gradients and poor optimization—UNIEGO introduces proxy models as intermediate representation-specific mediators. These proxies translate each teacher’s knowledge into a common egocentric feature space, resolving representational gaps. A subsequent selective proxy distillation (SPD) stage adaptively chooses reliable proxy supervision signals on a per-sample basis, avoiding noisy or conflicting gradients. Additionally, UNIEGO initializes the unified encoder as a learned convex combination of proxy parameters to stabilize training.

UNIEGO demonstrates substantial accuracy improvements over strong baselines across three challenging egocentric video benchmarks (EgoExo-Fitness, Assembly101, EgoExo4D) and multiple tasks including action recognition, video retrieval, and action segmentation. It outperforms naive multi-teacher distillation and single-modality approaches by up to +4.4% Top-1 accuracy in recognition. Analysis shows proxies close representational gaps (CKA similarity gains) and SPD reduces gradient conflicts by up to 50%. This framework advances egocentric video representation by effectively consolidating complementary cross-view and cross-modal knowledge into a deployable egocentric encoder requiring only egocentric RGB input at inference.

Key findings

UNIEGO outperforms naive multi-teacher distillation by +4.4% Top-1 accuracy on EgoExo-Fitness and +3.1% on Assembly101 (Table 2).
UNIEGO surpasses single teacher egocentric backbones (TimeSformer) by +4.4%, +3.1%, and +1.2% on EgoExo-Fitness, Assembly101, and EgoExo4D respectively.
Proxy learning raises pairwise feature similarity (CKA) among proxies compared to disparate teachers, confirming representational gap closure (Fig. 3).
Selective Proxy Distillation (SPD) reduces gradient conflict rate from ~60% in naive distillation to ~42-50%, yielding more cooperative distillation gradients (Figs. 4 and 5).
Proxy merging initialization places UNIEGO in a well-conditioned region of the loss landscape, improving optimization stability and final accuracy (Table 6).
UNIEGO improves video retrieval mAP by +5.7% on EgoExo-Fitness compared to naive multi-teacher distillation, demonstrating richer feature embeddings (Table 3).
UNIEGO features yield stronger temporal action segmentation metrics on Assembly101 than both training from scratch and naive distillation, e.g., +3.4 F1@10 (Table 4).
UNIEGO works effectively across multiple architectures, including compact UniFormer-S with only 22M parameters, confirming generality (Table 5).

Threat model

n/a - This work focuses on egocentric video representation learning rather than adversarial or security threats. The 'adversary' in the context of multi-teacher distillation is conceptualized as conflicting or noisy supervision signals arising from heterogeneous teacher modalities and viewpoints, which the method attempts to mitigate.

Methodology — deep read

Threat model and assumptions: The framework assumes access to multiple heterogeneous teacher models pretrained on different viewpoints (egocentric and exocentric), modalities (RGB, depth, skeleton), and foundation model representations. The adversary model is not explicitly defined, as this is an egocentric video representation learning contribution rather than a bot defense or security attack model. The focus is on reconciling heterogeneous teacher representations for improved egocentric embedding quality.
Data: The method is evaluated on three publicly available datasets: EgoExo-Fitness, Assembly101, and EgoExo4D. These datasets provide synchronized egocentric and exocentric video streams with multi-modal data and action classification labels (24 classes for Assembly101; details for others in Appendix). Splits are as per official protocol. The model only requires egocentric RGB video at inference.
Architecture/algorithm: The core architecture is a hierarchical multi-teacher distillation framework consisting of two levels:

Level-I: Proxy Learning. For each of R teachers, train an egocentric proxy model sharing the UNIEGO architecture (e.g., TimeSformer) that distills feature-level knowledge from teacher embeddings into egocentric proxy space. Proxies handle the representational heterogeneity by translating different modalities, viewpoints, and architectures into a homogeneous egocentric feature space.
Level-II: Proxy Merging and Selective Proxy Distillation (SPD). UNIEGO’s parameters θU are initialized as an optimal convex combination of proxy parameters (θU = ∑αr*θr). SPD then selectively distills from the subset of proxies that are both correct (predict ground truth label) and confident (lowest classification loss) for each training sample. The loss combines cosine distance on features and KL divergence on logits, plus classification loss.

Training regime: Proxies are trained independently for 15 epochs using SGD with lr=0.005, momentum=0.9, weight decay=1e-4, batch size 8 over 4 NVIDIA RTX A5000 GPUs. Proxy merging coefficients α∗ are optimized by Adam for 2 epochs with lr=0.02. SPD uses hyperparameters βfeat=5, βcls=1, and βlogit=[1 or 5 depending on dataset], top-K=1 or 2 selected proxies.
Evaluation protocol: Top-1 accuracy for action recognition; mAP and R@1 for video retrieval; F1 (at 10,25,50), Edit, and frame accuracy for action segmentation. Baselines include single egocentric backbones, naive multi-teacher distillation, and recent egocentric methods (π-ViT, EgoVLP, etc). Ablations examine components: proxy learning, merging init, SPD, and different proxy merging or selection strategies. Representational gap (CKA) and gradient conflicts (gradient cosine similarity, conflict rate) are analyzed qualitatively.
Reproducibility: Code and models are publicly released at https://github.com/Wenhao-Chi/UNIEGO. Datasets are public. Proxy architectures share the backbone of the unified model (TimeSformer), ensuring concrete reproducibility. Training details and hyperparameters are comprehensively disclosed.

Technical innovations

Proxy models as intermediate mediators translate heterogeneous teacher representations into a homogeneous egocentric embedding space, mitigating representational gaps.
Selective Proxy Distillation (SPD) dynamically selects, per sample, only correct and confident proxies for distillation, reducing noisy conflicting gradients.
Proxy merging initialization learns an optimal convex combination of proxy parameters to place the unified model in a well-conditioned optimization region before distillation.
Hierarchical two-level distillation framework explicitly disentangles modality, viewpoint, and architectural heterogeneity prior to unified model consolidation.
Demonstration that combining diverse viewpoints (ego-exo), modalities (RGB, depth, skeleton), and foundation models improves egocentric representation without requiring multimodal input at inference.

Datasets

EgoExo-Fitness — ~unknown size per paper detail — public egocentric-exocentric videos with fitness actions
Assembly101 — 46,202 train, 15,307 test samples — public, helmet-mounted egocentric paired with exocentric cameras, 24 action classes
EgoExo4D — size not specified — public egocentric-exocentric dataset with multimodal egocentric video

Baselines vs proposed

TimeSformer baseline (egocentric inference, EgoExo-Fitness): Top-1 accuracy = 80.3% vs UNIEGO 84.7%
Multiteacher Distillation [34/40] baseline (egocentric inference, Assembly101): 48.2% vs UNIEGO 50.7%
π-ViT (multi-teacher distillation, EgoExo-Fitness): 80.1% vs UNIEGO 84.7%
Naive multi-teacher distillation (video retrieval, EgoExo-Fitness): mAP = 0.486 vs UNIEGO 0.543
Scratch TimeSformer temporal action segmentation (Assembly101): F1@10 = 16.2% vs UNIEGO 19.6%
Uniformer-S backbone (egocentric inference, EgoExo-Fitness): baseline 68.4% vs UNIEGO 73.5%

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.20559.

Fig 1

Fig 1: (a) Naive multi-teacher distillation with heterogeneous teachers for learning unified

Fig 2

Fig 2: Overview of UNIEGO. UNIEGO learns a unified egocentric encoder through a two-

Fig 3

Fig 3 (page 3).

Fig 4

Fig 4 (page 3).

Limitations

Proxy selection relies on a small-loss heuristic which may discard useful supervision signals; no learned adaptive weighting explored.
Framework requires multiple pretrained teachers with synchronized multi-modal egocentric and exocentric data at training time.
Evaluation focuses on classification and segmentation on three datasets; robustness to domain shifts or unseen viewpoints not tested.
No explicit adversarial or attack robustness evaluation on distillation from possibly malicious or noisy teachers.
Two-stage training with proxies increases training complexity and resource requirements compared to end-to-end distillation.

Open questions / follow-ons

Can a learned, input-dependent proxy selection mechanism outperform the current small-loss heuristic to better exploit all teacher signals?
How does UNIEGO perform under domain shifts or with occluded/noisy egocentric inputs at inference?
Can the proxy-mediated hierarchical distillation framework be extended to continual or online learning scenarios with dynamic teacher sets?
What is the impact of scaling the number and diversity of teacher models beyond the current nine on unified representation quality and training complexity?

Why it matters for bot defense

For bot defense and CAPTCHA practitioners, UNIEGO represents a sophisticated approach to consolidating heterogeneous multi-modal signals through proxy-mediated distillation to produce a unified, robust representation from a single accessible modality (egocentric RGB video). While not directly related to bot detection, the proxy-based hierarchical distillation mechanism addresses core challenges of conflicting information and representation heterogeneity common in multi-source security contexts. This concept could inspire approaches to unify diverse behavioral signals or model ensembles for bot detection without requiring multi-channel data at inference. Additionally, the selective distillation technique illustrates how adaptive supervision can suppress noisy or adversarial information, a relevant principle for robust security model training. The detailed proxy merging initialization and gradient conflict analysis highlight methods to stabilize complex model fusion, potentially applicable in ensemble defenses involving heterogeneous classifiers. In summary, while focused on video representation, the proxy mediation and adaptive distillation ideas are broadly transferable to multi-teacher or multi-view learning problems in security and bot defense systems.

Cite

bibtex

@article{arxiv2606_20559,
  title={ UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning },
  author={ Wenhao Chi and Arkaprava Sinha and Dominick Reilly and Hieu Le and Srijan Das },
  journal={arXiv preprint arXiv:2606.20559},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.20559}
}

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​