UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Source: arXiv:2605.06665 · Published 2026-05-07 · By Minbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu, Guoxuan Chen, Xintong Yu et al.

TL;DR

UniPool addresses a structural inefficiency in standard Mixture-of-Experts (MoE) transformers: the rigid convention that each transformer layer owns its own private set of expert FFNs, which forces total expert parameters to scale linearly with depth. The authors first establish—through a routing-randomization probe on three production MoE models (Qwen1.5-MoE, DeepSeek-V2-Lite, Qwen3-30B-A3B)—that replacing the learned top-k router in a single deep-layer with uniform random assignment drops downstream accuracy by only 1.0–1.6 points, suggesting deep-layer experts are largely substitutable and therefore redundant. This motivates a different structural premise: expert capacity should be treated as a global architectural budget, not a per-layer allocation.

UniPool replaces all layer-private expert sets with a single globally shared pool of M expert FFNs, while retaining independent per-layer routers. Two co-design choices make this stable: (1) a pool-level auxiliary loss that balances expert utilization across the entire shared pool rather than forcing each layer to independently spread traffic over all experts, and (2) NormRouter, an L2-normalize-then-ReLU gating function that produces scale-stable, naturally sparse routing scores across a larger candidate pool. The architecture is evaluated at five LLaMA-style scales from 182M to 978M active parameters, all trained for 30B tokens on the Pile.

Results show UniPool consistently outperforms vanilla MoE at every tested scale, reducing validation loss by 0.0172 to 0.0386 nats and improving zero-shot accuracy across seven benchmarks. More importantly, reduced-pool variants using only 41.6%–66.7% of the vanilla expert-parameter budget still match or beat layer-private MoE, making pool size an explicit sublinear depth-scaling hyperparameter. Ablations confirm that neither sharing alone nor NormRouter alone explains the gains—the full three-way combination (shared pool + pool-level loss + NormRouter) is required.

Key findings

Routing-randomization probe on three production MoE models (Qwen1.5-MoE, DeepSeek-V2-Lite, Qwen3-30B-A3B) shows that replacing the learned top-k router in a single deep-half layer with uniform random routing drops average downstream accuracy by only 1.0–1.6 points (Table 1), indicating deep-layer private experts are largely substitutable.
UniPool reduces validation loss over matched vanilla MoE by 0.0288 (182M), 0.0346 (469M), 0.0308 (650M), 0.0386 (830M), and 0.0172 (978M) nats after 30B training tokens on the Pile (Table 2), consistent across all five tested scales.
Reduced-pool UniPool variants beat layer-private vanilla MoE while using only 66.7% of expert parameters at 182M, 50% at 469M and 650M, and 41.6% at 830M (Fig 2a), demonstrating sublinear expert-parameter scaling with depth.
Routing-randomization applied to UniPool's own trained 469M and 978M models shows a drop of 4.1 average accuracy points versus only 1.3–1.5 points for matched vanilla MoE models under the same protocol (Table 4), indicating the shared pool converts depth-induced redundancy into genuine expert specialization.
Using a shared pool with the original per-layer auxiliary loss actually hurts performance relative to vanilla MoE at 182M (1.9480 vs. 1.9317 validation loss), while replacing it with pool-level auxiliary loss recovers to 1.9180, and the full UniPool combination (shared pool + pool-level aux + NormRouter) reaches 1.9029 (Table 5 ablation).
NormRouter alone in a layer-private MoE slightly worsens validation loss (1.9375 vs. 1.9317), confirming its benefit is specific to the shared-pool routing regime rather than being a free improvement over softmax routing (Table 5).
Unique expert weight reuse per token increases with depth in UniPool: 94.1% at 12 layers, 89.5% at 24 layers, and 82.7% at 36 layers, quantifying the cross-layer reuse that the shared pool enables (Section 6.2).
UniPool's gains compose with finer expert granularity: at 182M, UniPool outperforms vanilla MoE under 8E/top-1, 16E/top-2, and 32E/top-4 configurations, with absolute validation loss improving as expert count increases in both methods (Fig 2b, Table 3).

Methodology — deep read

Threat model and assumptions: This is not a security paper. The implicit adversarial assumption is that per-layer expert ownership is a structural prior that wastefully duplicates expert functions across depth. The authors assume that sparse MoE expert FFNs represent the dominant parameter cost and that cross-layer reuse is feasible because different residual stream depths can benefit from the same underlying transformations, so long as routing policies remain layer-specific.

Data provenance, size, and preprocessing: All models are trained on the Pile dataset, a large publicly available English-language text corpus. Training runs for 60,000 iterations with batch size 512 and sequence length 1,024, totaling approximately 30B tokens. No special preprocessing beyond standard tokenization is described; the paper does not specify the tokenizer or vocabulary size explicitly, but uses LLaMA-style backbones implying LLaMA tokenization conventions. Validation loss is measured on a held-out split of the Pile. Downstream zero-shot evaluation uses seven standard academic benchmarks: ARC-Easy, ARC-Challenge, PIQA, HellaSwag, WinoGrande, LAMBADA, and RACE, evaluated with the standard lm-evaluation-harness protocol.

Architecture and novel components: The backbone is a LLaMA-style transformer at five active-parameter scales (182M, 469M, 650M, 830M, 978M). In vanilla MoE, each of L layers owns E=8 private expert FFNs with top-1 softmax routing. UniPool replaces these L×E per-layer experts with a single global pool of M = L×E shared expert FFNs (e.g., M=96 for 12 layers, M=384 for 48 layers), keeping per-layer routers. All routers index into the same M-dimensional expert pool; expert parameters are instantiated once in memory and reused. The pool-level auxiliary loss aggregates token-to-expert assignment fractions across all L layers (computing the global mean fraction f_i per expert), then applies a single Switch-style correlation penalty over the pool rather than L independent per-layer penalties. To avoid cross-layer tensor dependencies during backprop, global token-distribution statistics are computed one micro-batch behind (a stop-gradient approximation described in Appendix G). NormRouter replaces softmax gating with L2-normalize-then-ReLU scoring: scores s_i = σ·c·max(0, z_i/(||z||_2 + ε)), where σ is a per-router learnable scalar initialized to 1, c is a fixed Monte Carlo calibration constant, and ε is a stability term. This makes routing scores invariant to hidden-state scale and produces approximately 50% sparsity naturally via ReLU before top-k selection, which is argued to reduce spurious competition among the large shared pool. The combination of shared pool + pool-level loss + NormRouter is the proposed system; each component is individually ablated.

Training regime: All models use AdamW with a cosine learning-rate schedule and bf16 precision in Megatron-LM. Full optimizer hyperparameters are in Appendix D (not fully reproduced in the main paper excerpt). For statistical reliability, 182M results are averaged over three random seeds; larger scales use one run per configuration due to compute cost, which the authors acknowledge as a limitation. Reduced-pool variants are trained at each scale by simply setting M to fractions of the matched budget (e.g., M=64, 48 at 182M for 66.7%/50% of vanilla expert parameters), keeping top-1 routing so active FLOPs remain matched.

Evaluation protocol: The primary metric is validation loss (nats) on held-out Pile text, with perplexity reported as exp(loss). Downstream accuracy is zero-shot on the seven benchmarks above, reported as unweighted mean accuracy. Ablation studies are conducted at 182M only, varying: (a) whether the expert pool is shared or private, (b) whether the auxiliary loss is per-layer or pool-level, (c) whether the router is softmax or NormRouter, and (d) the sharing scope G (number of independent sub-pools, from G=12 for fully private to G=1 for fully shared). The routing-randomization probe (Table 1 on production models, Table 4 on own models) is a held-out behavioral test: the learned router in deep-half layers is replaced by uniform random assignment, the model is evaluated on the five benchmark tasks, and the accuracy drop relative to the learned router is recorded. For UniPool, the cardinality-matched protocol samples from each layer's top-8 most-used shared experts (not the full pool) to ensure a fair comparison against top-8 random within vanilla's 8-expert pool.

Concrete end-to-end example: Consider the 182M model (12 layers, hidden size 768, 8E/top-1 configuration). Vanilla MoE allocates 8 private expert FFNs per layer = 96 total expert FFN parameter tensors. UniPool allocates one shared pool of 96 expert FFNs. During a forward pass at layer 5, the NormRouter at layer 5 computes logits z = W_5·h (where h is the layer-5 residual stream hidden state), L2-normalizes them, applies ReLU to zero out ~48 of 96 scores, scales by σ_5·c, and selects the top-1 expert from the non-zero scores. The selected expert (say, expert 37) is the same nn.Module that layer 2 might have also selected for a different token with a different context. The pool-level auxiliary loss accumulates f_i (mean token fraction across all 12 layers dispatched to expert i) and P_i (mean routing probability across all 12 layers for expert i) computed with a one-step delayed global statistic, then adds α_pool · 96 · Σ f_i · P_i to the total loss. This single scalar penalty prevents any expert from being globally ignored without forcing every layer to use every expert.

Reproducibility: Code is open-sourced at https://github.com/Centaurus-Alpha/UniPool, implemented in Megatron-LM. The Pile dataset is publicly available. The paper does not mention releasing frozen checkpoints. Appendices (B, D, G, H) provide architectural tables, optimizer settings, implementation details for the delayed statistic, and Monte Carlo calibration for NormRouter's constant c. Larger-scale results use single runs, limiting statistical confidence.

Technical innovations

Global shared expert pool: replaces L independent per-layer expert sets with a single M-expert pool accessed by all layers simultaneously, decoupling expert-parameter count from transformer depth—prior work (Mixtral, DeepSeek series, GShard) universally retains per-layer private expert ownership.
Pool-level auxiliary loss: redefines load-balancing over the globally owned parameter pool by aggregating token-fraction and routing-probability statistics across all L layers before applying the Switch-style correlation penalty, rather than applying L independent per-layer Switch losses that are structurally misaligned with global parameter ownership.
NormRouter for shared-pool routing: adopts L2-normalize-then-ReLU gating (from KERN/NormRouter, Zuo et al.) as a co-design choice for shared-pool MoE, where its scale invariance addresses layer-dependent hidden-state norm variation across depths and its natural ~50% sparsity sharpens competition over a larger candidate pool without explicit sparsification.
Sublinear expert-parameter scaling: empirically demonstrates that pool size is an independent depth-scaling hyperparameter—reduced-pool UniPool variants with 41.6%–66.7% of vanilla expert parameters match or outperform layer-private MoE, showing expert parameters can grow sublinearly with depth under cross-layer sharing.
Routing-redundancy diagnosis as architectural motivation: introduces a routing-randomization probe (replacing deep-layer learned routers with uniform random) as a diagnostic tool, applies it to both production models (Table 1) and own trained models (Table 4) to characterize and confirm redundancy before and after architectural intervention.

Datasets

The Pile — ~30B tokens used for training (full dataset is ~825GB) — EleutherAI, publicly available
ARC-Easy — standard benchmark, ~2.4K test questions — Allen AI, publicly available
ARC-Challenge — standard benchmark, ~1.17K test questions — Allen AI, publicly available
PIQA — standard benchmark, ~1.8K test questions — publicly available
HellaSwag — standard benchmark, ~10K validation examples — publicly available
WinoGrande — standard benchmark, ~1.27K test questions — publicly available
LAMBADA — standard benchmark, ~5K test examples — publicly available
RACE — standard benchmark, reading comprehension — publicly available

Baselines vs proposed

Vanilla MoE 182M (8E/top-1): val loss = 1.9317, PPL = 6.9012 vs UniPool: val loss = 1.9029, PPL = 6.7058
Vanilla MoE 469M (8E/top-1): val loss = 1.7982, PPL = 6.0388 vs UniPool: val loss = 1.7636, PPL = 5.8334
Vanilla MoE 650M (8E/top-1): val loss = 1.7568, PPL = 5.7940 vs UniPool: val loss = 1.7260, PPL = 5.6186
Vanilla MoE 830M (8E/top-1): val loss = 1.7309, PPL = 5.6458 vs UniPool: val loss = 1.6923, PPL = 5.4320
Vanilla MoE 978M (8E/top-1): val loss = 1.7171, PPL = 5.5683 vs UniPool: val loss = 1.6999, PPL = 5.4736
Dense baseline 182M: val loss = 2.042, PPL = 7.708 vs UniPool 182M: val loss = 1.9029, PPL = 6.7058
Vanilla MoE 182M zero-shot avg accuracy = 38.74% vs UniPool 182M = 39.61% (Table 3)
Vanilla MoE 830M zero-shot avg accuracy = 43.82% vs UniPool 830M = 45.67% (Table 3)
Shared pool + layer aux + softmax (ablation): val loss = 1.9480 vs vanilla MoE: 1.9317 (shared pool alone with wrong loss is worse)
Shared pool + pool aux + softmax: val loss = 1.9180 vs full UniPool (pool aux + NormRouter): 1.9029 (Table 5)
Vanilla MoE 469M routing randomization: avg accuracy drop = -1.3 pts vs UniPool 469M: -4.1 pts (Table 4)
Vanilla MoE 978M routing randomization: avg accuracy drop = -1.5 pts vs UniPool 978M: -4.1 pts (Table 4)
Reduced-pool UniPool 182M at 66.7% expert budget: val loss = 1.9215 vs vanilla MoE: 1.9317 (still outperforms with fewer parameters)
Reduced-pool UniPool 830M at 41.6% expert budget: Δval loss = -0.013 vs vanilla MoE (outperforms with 41.6% of the expert parameters)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06665.

Fig 3

Fig 3: Expert utilization at the 182M scale: per-layer auxiliary loss leads to global expert collapse,

Fig 2

Fig 2 (page 10).

Limitations

Single-seed evaluation at 469M, 650M, 830M, and 978M scales: only the 182M results are averaged over three seeds, so the reported loss improvements at larger scales lack statistical confidence bounds and could reflect run-to-run variance rather than consistent gains.
Scale ceiling at ~1B active parameters: all experiments stay below 1B active parameters and use a relatively small 30B-token training budget; it is unclear whether the sublinear expert-parameter scaling benefit holds at the 7B, 13B, or 70B scales where MoE decisions have the most practical impact.
No inference/serving cost analysis: the paper focuses on parameter efficiency (training-time budget) but does not analyze memory bandwidth, KV-cache pressure, or expert-dispatch latency under shared-pool routing at inference time—where all-to-all communication patterns may differ from vanilla MoE.
Reduced-pool results tested only on the Pile, a single (though diverse) English-language corpus: generalization of the sublinear scaling finding to multilingual, code-heavy, or domain-specific training distributions is untested.
The one-micro-batch-delayed global statistic for the pool-level auxiliary loss introduces a subtle approximation whose effect on training stability at larger scales or with larger batch sizes is not studied; the paper acknowledges this implementation detail but provides no ablation over the delay.
NormRouter's fixed calibration constant c is estimated via Monte Carlo sampling and fixed before training; there is no analysis of sensitivity to this constant or whether it needs to be re-estimated for different pool sizes, hidden dimensions, or training stages.
The routing-randomization probe for UniPool uses a cardinality-matched top-8 protocol (sampling from each layer's most-used experts) rather than the full pool, which is a reasonable design choice but makes the vanilla vs. UniPool comparison not perfectly symmetric—the full-pool randomization results are deferred to an appendix table and not discussed in depth in the main paper.

Open questions / follow-ons

Does sublinear expert-parameter scaling hold at production MoE scales (7B–70B active parameters, 1T+ training tokens), or does the benefit saturate or reverse when the shared pool must serve a much larger number of layers with more heterogeneous representations?
What is the optimal pool-size-to-depth ratio as a function of model scale, and can a principled scaling law be derived relating pool size M, number of layers L, hidden dimension d, and downstream loss?
How does UniPool interact with expert parallelism and all-to-all communication in distributed training? Vanilla MoE's per-layer expert sets enable layer-local dispatch; a globally shared pool may require cross-layer coordination that changes communication topology and could bottleneck throughput at scale.
Can the routing-randomization probe—where a small drop indicates redundancy and a large drop indicates specialization—serve as a practical diagnostic during training to dynamically adjust pool size or routing temperature, effectively making pool allocation adaptive rather than fixed at architecture definition time?

Why it matters for bot defense

For bot-defense and CAPTCHA systems, the direct relevance of UniPool is architectural rather than adversarial: it is a paper about making language model experts more parameter-efficient, not about attacking or defending CAPTCHAs. However, bot-defense practitioners running LLM-based components—such as challenge generation, behavioral anomaly scoring, or NLU for abuse detection—should note the efficiency claim. If a production system uses a sparse MoE model for any of these tasks, UniPool's finding that 41.6%–66.7% of expert parameters can be eliminated without quality loss (and sometimes with improvement) is directly actionable for reducing serving costs. Smaller expert pools mean lower memory pressure, which matters when these models run in latency-sensitive inline request paths.

More broadly, the routing-redundancy findings have an indirect implication for adversarial ML against MoE-based classifiers. If deep-layer experts in vanilla MoE are largely substitutable (as shown by the 1.0–1.6 point drop under random routing), then an adversary probing or attacking a vanilla MoE abuse classifier may be able to exploit this redundancy—e.g., by crafting inputs that route to interchangeable experts, making adversarial transfers easier. UniPool's claim that shared-pool routing produces genuinely specialized experts (4.1-point drop under randomization vs. 1.3–1.5) suggests that, if validated at scale, a UniPool-based classifier might be harder to attack via routing manipulation. This is speculative at the tested scales and would require dedicated adversarial evaluation before influencing product decisions.

Cite

bibtex

@article{arxiv2605_06665,
  title={ UniPool: A Globally Shared Expert Pool for Mixture-of-Experts },
  author={ Minbin Huang and Han Shi and Chuanyang Zheng and Yimeng Wu and Guoxuan Chen and Xintong Yu and Yichun Yin and Hong Cheng },
  journal={arXiv preprint arXiv:2605.06665},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06665}
}

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​