Elastic Attention Cores for Scalable Vision Transformers

Source: arXiv:2605.12491 · Published 2026-05-12 · By Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai et al.

TL;DR

VECA (Visual Elastic Core Attention) is proposed as a replacement for dense patch-to-patch self-attention in Vision Transformers. The core idea is to route all patch interaction through a small, learned set of “core” tokens, so patches never attend directly to other patches. This changes the attention graph from all-to-all to a core-periphery structure, reducing the per-layer attention cost from quadratic in the number of patches to approximately linear for fixed core budget C.

What is new here is not just sparsity, but elasticity: the model is trained with nested dropout over the core axis so different prefixes of the learned core bank can be activated at test time without retraining. The paper claims this lets one trade accuracy for compute on demand. Empirically, the authors report that a ViT-B-sized VECA model stays close to DINOv3 on frozen-backbone classification and dense prediction, while using far fewer attention interactions. The strongest results they emphasize are 81.93 top-1 on ImageNet-1K and 57.46 mIoU on PASCAL Context at C=64, with a substantial efficiency reduction relative to dense self-attention.

Key findings

At ViT-B scale, Ours-B/16 (C=64) reaches 81.93 top-1 on ImageNet-1K vs DINOv3-B/16 at 83.56, a gap of 1.63 points while using core-mediated attention instead of dense patch self-attention.
On PASCAL Context, Ours-B/16 (C=64) gets 57.46 mIoU vs DINOv3-B/16 at 57.74, only 0.28 mIoU lower; the paper reports this despite 87.1% fewer attention interactions per layer.
On VOC 2012 segmentation, Ours-B/16 (C=64) achieves 87.07 mIoU vs DINOv3-B/16 at 87.50, a 0.43 mIoU gap.
On NYUv2 depth, Ours-B/16 (C=64) reports RMSE 0.3705 vs DINOv3-B/16 at 0.3684; the paper calls this a negligible 0.0021 RMSE gap.
Reducing the active budget to C=8 hurts dense tasks more than classification: Context drops to 53.30 mIoU from 57.46, while ImageNet-1K only drops to 79.99 from 81.93.
The paper states that C=8 corresponds to only 1.6% of the attention interactions of the full matrix at 512 resolution for dense prediction, yet the model remains competitive.
For classification, the authors report that C=8 still retains at least 96% of the full model’s performance on their evaluated benchmarks, indicating strong budget robustness for global recognition.
Dense prediction improves monotonically as core budget increases from C=8 to C=64 in Fig. 5(a), while classification remains comparatively stable across the same budget sweep.

Threat model

Not a security/threat-model paper. The relevant assumption is computational scaling rather than an adversary: the model must handle high-resolution images efficiently, and the design eliminates direct patch-to-patch attention under the assumption that global visual representations can still be learned through a small learned core interface. The paper does not define an attacker, attacker knowledge, or attack capabilities.

Methodology — deep read

The paper’s threat model is not adversarial in the security sense; the central assumption is an efficiency/representation setting for vision backbones. The authors challenge the hypothesis that rich visual features require direct patch-to-patch interaction. Instead, they assume access to large-scale unlabeled image data, a strong frozen teacher, and downstream linear-probe evaluation. The main adversary, conceptually, is computational scaling: high-resolution inputs make dense self-attention too expensive. The paper does not model an attacker manipulating inputs or attempting model extraction.

For data, training is done by feature distillation from a frozen DINOv3 teacher on unlabeled Objects365 images. The paper says Stage 1 trains at a fixed 256×256 resolution while sampling active core budgets, and Stage 2 continues from that checkpoint while adapting to higher-resolution teacher targets from {384, 512, 768}. The text in the excerpt does not specify the exact number of training images used from Objects365, the split strategy, or whether any subset filtering is applied. Downstream evaluation uses frozen-backbone linear probing on PASCAL VOC 2012, PASCAL Context-60, ADE20K-150, COCO-Object, COCO-Stuff, Cityscapes, NYUv2, KITTI, ImageNet-1K, ImageNetV2, ImageNet-ReaL, Places365, Food101, SUN397, Oxford-Pets, and CUB-200. For dense benchmarks, the paper states feature extraction at 512 resolution for 16-pixel patches or 518 for 14-pixel patches.

Architecturally, VECA replaces standard self-attention with a block-sparse core-periphery pattern. Given N patch tokens Z and an ordered bank of M learned core tokens R_M, the model selects a budget C≤M and concatenates RC=[r1…rC] with the patches. The key computation is asymmetric: cores attend to the full sequence, R′=Attn(RC, X, X), while patches attend only to cores, Z′=Attn(Z, RC, RC), and the outputs are concatenated. This means information can flow patch→core→patch in two hops, but patch→patch direct links are removed. The authors also attach a learned 2D coordinate to each core token, updated each layer by a lightweight coordinate head and tanh squashing, so core positions can move over the image plane across depth. They use 2D axial RoPE for both patch and core tokens. The novel part is not just the sparse graph but the fact that the full patch set is preserved and updated throughout, unlike Perceiver-style latent bottlenecks that only refine a small latent set.

Training uses a distillation objective that combines global and dense feature matching: L(I,C)=L_cls(x_cls^(C), x_cls)+λ L_dense(x_dense^(C), x_dense), where x_cls^(C) is the final-layer feature of the first core token (used as the [CLS]-like embedding) and x_dense^(C) are the final-layer patch features. The paper says the model is trained with nested dropout along the core axis: at each step, a core budget C is sampled from p_C(·), and the active prefix of the core bank is used. This induces an ordering where early cores are active more often and therefore learn broadly useful information, while later cores capture finer detail. The paper does not provide, in the excerpt, the exact distribution p_C, optimizer choice, learning rate schedule, batch size, epoch count, or seed protocol; it says these are in the appendix. Likewise, the exact VECA block depth, width, number of heads, and maximum core capacity M are not fully specified in the provided text, though results are reported for C∈{8,…,64}.

Evaluation is a frozen-backbone linear-probe protocol. The authors compare VECA against CLIP, OpenCLIP, DFNCLIP, SigLIP 2, AM-RADIOv2.5, DINOv2, DINOv2-reg, and DINOv3. Dense prediction uses mIoU on segmentation datasets and RMSE on depth datasets; classification uses top-1 accuracy. The most important ablation is the core budget sweep: Table 2 reports C=8 and C=64 on classification, and Fig. 5(a) sweeps 8 to 64 in steps of 8 for dense prediction and depth. The paper also includes qualitative analyses: Fig. 4 shows patch-to-patch cosine-similarity maps from dense features, Fig. 6 visualizes distinct core tokens attending to different spatial regions, and Fig. 7 shows UMAP evolution from isotropic to semantically clustered core states. One concrete end-to-end example is the ImageNet-1K probe at C=64: the frozen VECA encoder produces its final first-core embedding, a linear classifier is trained on top, and the resulting top-1 accuracy is 81.93.

Reproducibility is partial. The paper mentions a project repository and says additional training details are in the appendix, but the excerpt does not provide code release specifics, frozen checkpoints, exact hyperparameters, or dataset licensing details. Since the main training uses unlabeled Objects365 plus a frozen DINOv3 teacher, exact reproduction depends on access to both the training pipeline and the teacher weights. The downstream evaluation is relatively standard and reproducible in principle, but the elastic-core training procedure and coordinate-head implementation are the critical pieces that need the missing appendix details.

Technical innovations

Replaces all-to-all patch self-attention with a core-periphery attention graph where patches only communicate through learned core tokens, yielding O(2NC+C^2) attention cost instead of O(N^2).
Uses ordered core banks with nested dropout over the core axis so a single trained model supports multiple inference budgets without retraining.
Maintains and updates the full patch set at every layer, avoiding the small-latent bottleneck common in Perceiver-like designs while still keeping linear-time scaling.
Adds learnable 2D core coordinates that move across layers via a lightweight coordinate head, allowing cores to act as mobile semantic interfaces rather than fixed slots.

Datasets

Objects365 — size not specified in excerpt — non-public training source used for distillation
PASCAL VOC 2012 — size not specified in excerpt — public benchmark
PASCAL Context-60 — size not specified in excerpt — public benchmark
ADE20K-150 — size not specified in excerpt — public benchmark
COCO-Object — size not specified in excerpt — public benchmark
COCO-Stuff — size not specified in excerpt — public benchmark
Cityscapes — size not specified in excerpt — public benchmark
NYUv2 — size not specified in excerpt — public benchmark
KITTI — size not specified in excerpt — public benchmark
ImageNet-1K — size not specified in excerpt — public benchmark
ImageNetV2 — size not specified in excerpt — public benchmark
ImageNet-ReaL — size not specified in excerpt — public benchmark
Places365 — size not specified in excerpt — public benchmark
Food101 — size not specified in excerpt — public benchmark
SUN397 — size not specified in excerpt — public benchmark
Oxford-Pets — size not specified in excerpt — public benchmark
CUB-200 — size not specified in excerpt — public benchmark

Baselines vs proposed

DINOv3-B/16: ImageNet-1K top-1 = 83.56 vs proposed (Ours-B/16, C=64): 81.93
DINOv3-B/16: PASCAL Context mIoU = 57.74 vs proposed (Ours-B/16, C=64): 57.46
DINOv3-B/16: VOC mIoU = 87.50 vs proposed (Ours-B/16, C=64): 87.07
DINOv3-B/16: NYUv2 RMSE = 0.3684 vs proposed (Ours-B/16, C=64): 0.3705
DINOv2-reg-B/14: ImageNet-1K top-1 = 83.44 vs proposed (Ours-B/16, C=64): 81.93
CLIP-B/16: ImageNet-1K top-1 = 79.33 vs proposed (Ours-B/16, C=64): 81.93
SigLIP 2-B/16: PASCAL Context mIoU = 47.51 vs proposed (Ours-B/16, C=64): 57.46
AM-RADIOv2.5-B/16: ImageNet-1K top-1 = 80.53 vs proposed (Ours-B/16, C=64): 81.93

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.12491.

Fig 1

Fig 1: Outline of our core-periphery attention structure. (a) Self-attention utilizes a fully

Fig 2

Fig 2: Architecture of VECA. (a) Our attention matrix is defined using a core-periphery structure

Fig 3

Fig 3: Learned dense representations. We compare the UMAP visualizations of dense

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

The paper trains with a fixed maximum core capacity and a manually chosen budget set; budget selection is not content-aware or task-adaptive.
The excerpt does not specify exact training hyperparameters, optimizer, batch size, epochs, or random seed strategy, limiting reproducibility from the provided text alone.
Evaluation is mostly frozen-backbone linear probing; the paper does not report end-to-end fine-tuning results in the excerpt.
No adversarial robustness, OOD robustness, or distribution-shift stress test is reported beyond standard benchmark transfer.
The model is validated on classification, segmentation, and depth, but the authors explicitly note that detection, instance segmentation, video, and task-specific fine-tuning remain untested.
Core behavior analyses are qualitative (UMAPs, similarity maps) and the paper says more systematic analysis is needed to quantify redundancy, stability, and semantic consistency.

Open questions / follow-ons

Can core budgets be selected dynamically from image content, resolution, or downstream task rather than by a fixed manual schedule?
How should the maximum core capacity M scale with model width, training data scale, and target resolution to preserve accuracy/efficiency tradeoffs?
Do the learned core tokens remain stable under fine-tuning, distribution shift, or adversarial perturbations, or do they collapse/reorganize?
How does VECA perform on detection, instance segmentation, video, or open-vocabulary tasks where dense spatial fidelity may matter even more than in the reported benchmarks?

Why it matters for bot defense

For bot-defense and CAPTCHA practitioners, the main relevance is architectural rather than domain-specific: VECA suggests that strong visual features do not require full all-to-all patch mixing. If you are building vision-based challenge solvers, detector backbones, or user-behavior image classifiers, a core-periphery transformer could be a cheaper backbone for high-resolution inputs while preserving dense features. That matters when latency, throughput, and resolution are constrained.

From a defensive angle, the paper is also a reminder that reducing token interactions does not necessarily cripple semantic understanding. If a CAPTCHA system relies on the assumption that global context only emerges from dense self-attention, this result weakens that intuition. An attacker could plausibly use a VECA-like backbone to get most of the accuracy of a stronger ViT at lower compute. For defenders, that means evaluation should not assume quadratic attention is a meaningful barrier; harder tasks should be measured against efficient nonstandard vision backbones, not just classic ViTs.

Cite

bibtex

@article{arxiv2605_12491,
  title={ Elastic Attention Cores for Scalable Vision Transformers },
  author={ Alan Z. Song and Yinjie Chen and Mu Nan and Rui Zhang and Jiahang Cao and Weijian Mai and Muquan Yu and Hossein Adeli and Deva Ramanan and Michael J. Tarr and Andrew F. Luo },
  journal={arXiv preprint arXiv:2605.12491},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.12491}
}

Elastic Attention Cores for Scalable Vision Transformers ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​