Sparse MoE Routing

TL;DR

Routing is the half of MoE that is not the experts themselves: a small network plus a selection rule that decides which experts see which tokens.
Token-choice top-k (Switch Transformer, Mixtral) — each token picks its top-k experts. Simple, but prone to load imbalance.
Expert-choice (Zhou et al. 2022) — each expert picks its top-c tokens. Perfectly balanced by construction; tokens may be dropped.
Auxiliary-loss-free balancing (DeepSeek-V3, 2024) — heuristic bias adjustment that achieves balance without a competing training objective.

The Routing Problem#

A MoE layer must answer two questions for every token: which experts should process it, and how should their outputs be combined? The first is the routing problem. The second is straightforward (softmax-weighted sum of expert outputs).

Routing decisions are discrete, which makes the layer non-differentiable in the routing step. Different routing schemes trade off differentiability, load balance, expressiveness and inference complexity.

Token-Choice Top-k#

The dominant scheme. The router produces a logit per expert; each token picks its top-k experts and applies a softmax over their logits to obtain weights. Used by Mixtral (k=2), DeepSeek-V3 (k=8 routed + 1 shared), Qwen3-MoE (k=8), GLaM (k=2).

Pros: simple, well-understood, works at scale. Cons: load can drift wildly without an explicit balancing term. The 'top-1' special case (Switch Transformer) further simplifies the algebra but is even more imbalance-prone.

Expert-Choice Routing#

Zhou et al. (2022) inverted the choice: instead of each token choosing experts, each expert chooses its top-c tokens. With N experts each choosing c tokens, exactly N·c tokens are processed per layer — perfect load balance by construction.

Pros: no auxiliary loss needed, no expert collapse. Cons: tokens not chosen by any expert are effectively dropped from that layer (the residual stream still carries them, so it is not catastrophic). Best for training; harder to deploy because the chosen-token set varies per batch.

Auxiliary-Loss-Free Balancing (DeepSeek-V3)#

DeepSeek-V3 introduced a token-choice variant where router logits are biased by per-expert offsets that are updated heuristically each step: experts with above-average load get their bias decreased, those with below-average load get theirs increased. The biases nudge routing toward balance without an explicit auxiliary loss competing with the language-modelling objective.

Empirically, this matched the load balance of auxiliary-loss methods while improving final quality — one of the contributions credited for DeepSeek-V3's strong benchmark performance.

If you are implementing MoE from scratch in 2026, start with DeepSeek-V3's bias-update scheme rather than the original Switch auxiliary loss. It is simpler and generally produces a better optimisation trajectory.

Capacity and Token Dropping#

In practice, every routing scheme is implemented with a capacity factor — the maximum number of tokens an expert will accept per batch, typically 1.25-2.0 × the average load. Tokens that would push an expert over capacity are either re-routed to their second-choice expert or dropped (their MoE-layer output becomes zero, with the residual stream carrying them unchanged).

Token dropping is more common at training than at serving, where smaller batch sizes give the router less room to balance. Serving engines like vLLM and TensorRT-LLM use capacity factors aggressively to bound per-expert memory while accepting some token re-routing as the cost.

Communication Cost#

With expert parallelism, routing produces an all-to-all communication: every device must send tokens to wherever their chosen experts live and receive tokens from elsewhere for the experts it hosts. This is the dominant cost in distributed MoE training and serving.

Hopper's NVLink 4 (900 GB/s per GPU) and Blackwell's NVLink 5 (1.8 TB/s per GPU) make this viable inside a node. Across nodes, InfiniBand NDR/XDR at 400/800 Gb/s is required; without it MoE is impractical at frontier scale.

References

Mixture-of-Experts with Expert Choice Routing (Zhou et al., 2022) · arXiv
Switch Transformer (Fedus et al., 2021) · arXiv
DeepSeek-V3 Technical Report (2024) · arXiv

The Routing Problem#

Token-Choice Top-k#

Expert-Choice Routing#

Auxiliary-Loss-Free Balancing (DeepSeek-V3)#

Empirically, this matched the load balance of auxiliary-loss methods while improving final quality — one of the contributions credited for DeepSeek-V3's strong benchmark performance.

Capacity and Token Dropping#

Communication Cost#

Sparse MoE Routing

The Routing Problem#

Token-Choice Top-k#

Expert-Choice Routing#

Auxiliary-Loss-Free Balancing (DeepSeek-V3)#

Capacity and Token Dropping#

Communication Cost#

References

Browse all entries

Deploy on Yobitel

Sparse MoE Routing

The Routing Problem#

Token-Choice Top-k#

Expert-Choice Routing#

Auxiliary-Loss-Free Balancing (DeepSeek-V3)#

Capacity and Token Dropping#

Communication Cost#

References

Browse all entries

Deploy on Yobitel