TL;DR
- Mixture of Experts replaces a dense feed-forward layer with N expert sub-networks plus a router that selects k of them per token; total parameter count grows with N, compute per token grows only with k.
- Shazeer et al.'s 2017 'Outrageously Large Neural Networks' (arXiv:1701.06538) introduced sparsely-gated MoE; Switch Transformer (Fedus et al., 2021, arXiv:2101.03961) simplified it to top-1; Mixtral 8x7B (Dec 2023) brought it into the open-weights mainstream; DeepSeek-V3 (Dec 2024, 671B total / 37B active) made it the frontier-quality default.
- MoE breaks the dense-Transformer scaling wall: roughly 3-7x cheaper inference at equivalent dense-quality, with the same memory footprint as a dense model of the same total parameter count.
- Trades off training stability (router collapse, load imbalance), inference complexity (all-to-all communication across expert-parallel devices) and KV cache cost (full per token regardless of expert routing) for materially better parameter-to-FLOP economics.
- Standard 2026 production stack: dense attention + GQA + RoPE + RMSNorm + MoE SwiGLU FFN with top-8 routing, fine-grained experts (256+ per layer), auxiliary-loss-free balancing, expert parallelism across NVLink-connected GPUs.
Overview#
Mixture of Experts is the answer to a question dense Transformers cannot ducked: how do you make a model bigger without paying for every parameter on every token? In a dense feed-forward layer, every input token passes through the full matrix multiplication. Whether the token is a punctuation mark or a multi-step reasoning anchor, it pays the same compute cost. MoE breaks that link. The layer holds N expert sub-networks, each smaller than a dense FFN; a tiny router decides which k of them should process each token. Only the chosen experts run. Parameters scale with N, FLOPs per token scale with k, and the gap between them is the architectural lever that lets a 671B-parameter model run with the per-token FLOPs of a 37B dense one.
The idea is older than the Transformer. Shazeer et al. published 'Outrageously Large Neural Networks' at ICLR 2017, months before 'Attention Is All You Need', and demonstrated 137 billion parameters on a sparsely-gated MoE stack — a scale that was unthinkable for dense models of the era. The technique sat as a research curiosity for four years until Switch Transformer (Fedus, Zoph and Shazeer, 2021) simplified it to top-1 routing, reached 1.6 trillion parameters, and made the engineering practical. Then Mixtral 8x7B (Mistral, December 2023) shipped an MoE model with weights anyone could download, with Llama-2-level quality at one-third the inference cost. The architecture went from experimental to default in roughly twelve months.
DeepSeek-V3 (December 2024) is the current high-water mark: 671 billion total parameters, 37 billion active per token, 256 routed experts plus 1 shared expert per layer, fine-grained expert size, auxiliary-loss-free balancing, multi-token prediction at training. It matched or beat GPT-4-class quality on most public benchmarks while costing roughly $5.6 million to train — an order of magnitude less than the rumoured dense-frontier budgets of the same era. Most of the frontier closed models (GPT-4, Claude 4, Gemini 2) are widely believed to be MoE based on serving cost economics, though only the open-weights side has confirmed the recipe in print.
This entry is the systems reference for engineers working with MoE in 2026: the routing options, the load-balancing schemes, the memory and FLOP arithmetic, the all-to-all communication patterns that dominate distributed training, the inference engine support, and the failure modes that derail unprepared training runs. This entry helps you understand MoE well enough to decide whether a sparse model fits your serving budget, size the multi-GPU footprint a DeepSeek-V3 or Mixtral 8x22B will actually need, and avoid the router-collapse failure mode that ends amateur MoE training runs around step 5k. If you are deploying MoE models on Yobibyte or training one on Yobitel NeoCloud, this matters because the catalogue's frontier MoEs (DeepSeek-V3, Mixtral 8x22B, Qwen3-MoE 235B, Llama 4 Scout) demand HBM and NVLink topology that the picker reasons about explicitly, and the per-token economics in this entry are why MoE is on offer at all.
Quick start: top-k MoE in PyTorch#
The shortest path to understanding sparse routing is to implement a two-expert MoE layer from scratch. The snippet below runs today with `pip install torch` on CPU or GPU. It defines a router and two SwiGLU experts, performs top-1 routing for clarity, and prints which expert each token landed on so you can watch load shift as the input changes.
# moe_minimal.py — runs with: pip install torch && python moe_minimal.py
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(0)
class SwiGLUFFN(nn.Module):
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w_gate = nn.Linear(d_model, d_ff, bias=False)
self.w_up = nn.Linear(d_model, d_ff, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))
class MoELayer(nn.Module):
def __init__(self, d_model: int, d_ff: int, num_experts: int, top_k: int):
super().__init__()
self.router = nn.Linear(d_model, num_experts, bias=False)
self.experts = nn.ModuleList(
[SwiGLUFFN(d_model, d_ff) for _ in range(num_experts)]
)
self.top_k = top_k
def forward(self, x: torch.Tensor):
# x: (tokens, d_model)
logits = self.router(x) # (tokens, num_experts)
topk_vals, topk_idx = logits.topk(self.top_k, dim=-1)
weights = topk_vals.softmax(dim=-1) # (tokens, k)
out = torch.zeros_like(x)
for slot in range(self.top_k):
for e, expert in enumerate(self.experts):
mask = topk_idx[:, slot] == e
if mask.any():
out[mask] += weights[mask, slot:slot+1] * expert(x[mask])
# Return router logits too so you can compute the balancing loss.
return out, logits
# Smoke test: route 16 tokens through 4 experts with top-2.
moe = MoELayer(d_model=64, d_ff=128, num_experts=4, top_k=2)
x = torch.randn(16, 64)
y, logits = moe(x)
print("token -> chosen experts (top-2):")
print(logits.topk(2, dim=-1).indices)
print("output shape:", y.shape)
# Run multiple inputs to watch experts specialise as the router warms up.The per-expert Python loop above is fine for understanding but is the slow path at scale. Production stacks (Megablocks, DeepSpeed MoE) use a single grouped GEMM that processes all experts in one CUDA kernel. On a single H100 SXM5, that change is the difference between 200 tokens/sec and 200,000 tokens/sec.
How it works: routing, experts, and the parameter-to-FLOP split#
A Mixture-of-Experts layer is a drop-in replacement for the position-wise feed-forward block in a Transformer. The attention sub-layer is unchanged. What changes is that the single dense FFN — typically two-thirds of the parameters in a dense block — becomes N expert FFNs plus a router. Almost every modern MoE architecture keeps dense attention; the sparsification is FFN-only.
The router is a single linear projection from the residual stream (dimension d_model) to N logits, one per expert. Token i's logits are passed through a top-k selector, the chosen logits are softmaxed (so the k weights sum to 1), and the token is dispatched to those k experts. Each expert is itself a SwiGLU FFN of the same shape; the difference is each expert is narrower than a dense baseline FFN would be, so the total parameter count stays in the same ballpark per expert but the network has many of them.
Mathematically, MoE(x) = sum over chosen experts e of g_e(x) * Expert_e(x), where g_e(x) is the softmax weight from the router for expert e on token x. Tokens that did not pick e get zero weight and Expert_e never runs for them. The non-differentiability of the top-k selection is handled by passing gradient through the soft weights g_e and through whichever experts actually fired; experts that did not fire receive no gradient signal for that token. Over a batch of many tokens, every expert sees enough updates — provided the load balancing works.
The parameter-to-FLOP split is the central economic claim. A dense FFN with d_ff = 2.67 * d_model has roughly 8 * d_model^2 parameters and the same FLOPs per token. An MoE FFN with N = 256 fine-grained experts each at d_ff_expert = (2.67 * d_model) / 8 and top-k = 8 has roughly 256/8 = 32x the parameters but the same FLOPs per token. That is the lever. Memory cost scales linearly with N (every expert has to live in HBM somewhere), so MoE is great when you have plenty of HBM and a constrained FLOP budget — exactly the situation for serving at large batch sizes on H100/H200/B200 clusters.
Inference treats the KV cache exactly like a dense model: the cache is per token, per layer, per attention head — not per expert. That is because attention is dense; only the FFN is sparse. So a 671B-parameter MoE has the same KV cache footprint per token as a hypothetical 671B dense model of the same depth and head count. The MoE saving is FLOPs and active weight bandwidth per token; it is not KV cache.
- Total parameters P_total: scales with number of experts N.
- Active parameters per token P_active: scales with top-k k. Roughly P_total * (k/N) for routed experts, plus the always-on shared expert if present.
- FLOPs per token: same as a dense model of P_active size (plus a small router-cost term, typically <1 %).
- Weight bandwidth per token at inference: only the active experts' weights are streamed from HBM each step, so memory bandwidth pressure tracks P_active not P_total.
- KV cache: identical to a dense model of the same depth and attention configuration. MoE does not shrink it.
Variants and architectural choices: MoE configuration in shipped models#
Authoritative table of the configurations that have actually shipped in frontier models. The pattern shifted decisively between 2021 and 2024: from 'few large experts, top-1' (Switch Transformer) to 'many fine-grained experts, top-2 to top-8 with at least one shared expert, auxiliary-loss-free balancing' (DeepSeek-V3, Qwen3-MoE). The newer pattern produces better quality at fixed compute and is more robust during training.
| Model | Total params | Active params | Experts/layer | Top-k | Shared expert? | Balancing |
|---|---|---|---|---|---|---|
| Switch Transformer (2021) | 1.6 T | Variable | 2,048 | 1 | No | Aux load loss |
| GLaM (2022) | 1.2 T | 97 B | 64 | 2 | No | Aux load loss |
| Mixtral 8x7B (2023) | 46.7 B | 12.9 B | 8 | 2 | No | Aux load loss |
| Mixtral 8x22B (2024) | 141 B | 39 B | 8 | 2 | No | Aux load loss |
| DeepSeek-V2 (2024) | 236 B | 21 B | 160 + 2 shared | 6 + 2 | Yes (2) | Aux load loss |
| DeepSeek-V3 (2024) | 671 B | 37 B | 256 + 1 shared | 8 + 1 | Yes (1) | Auxiliary-loss-free bias |
| Qwen3-MoE 235B (2025) | 235 B | 22 B | 128 | 8 | No | Aux load + bias |
| Llama 4 Scout (2025) | 109 B | 17 B | 16 + 1 shared | 1 + 1 | Yes (1) | Aux load loss |
The fine-grained-expert pattern (DeepSeek-V3, Qwen3-MoE) gives the router more freedom to specialise and produces better quality per active parameter, at the cost of more all-to-all communication. The trade-off is worth it on NVLink-connected GPUs; less so on PCIe-only clusters.
Where it is used today: frontier open-weights MoEs and the serving stack around them#
Mixture-of-Experts has gone from research curiosity to default frontier shape inside a single product cycle. The serving stack around it is mature on Hopper and Blackwell GPUs, and the open-weights models are credible direct replacements for closed-API frontier services for many workloads.
Mixtral 8x22B (Mistral, April 2024) is the canonical 'fits in a single 8x H100 node' frontier MoE: 141 B total, 39 B active, 8 experts per layer, top-2 routing. The official Apache 2.0 weights make it the standard reference workload for MoE serving research. vLLM has shipped MoE kernels since 0.3 and is the most common production serving engine for it; the snippet below illustrates a typical 8x H100 tensor-parallel load — illustrative, not deployment-grade.
DeepSeek-V3 sits one tier up in scale: too large to fit on a single 8x H100 node in BF16 (the 671B model needs ~1.3 TB just for weights). The standard recipe is FP8 expert weights plus a small set of BF16 always-hot layers, sharded across two 8x H100 nodes connected by InfiniBand NDR. SGLang and the official DeepSeek inference image both ship pre-tuned configurations with expert parallelism (`--enable-ep-moe`) and a Hopper-tuned all-to-all kernel (`--moe-a2a-backend deepep`). On a typical NeoCloud deployment, DeepSeek-V3 lands on 16 GPUs spread across two physical nodes; smaller MoEs like Mixtral 8x7B and Qwen3-MoE 235B will fit on one 8x H100 node or even one H200 in some configurations.
Continued pretraining and instruction tuning of MoE models is the third common production use, and it needs careful handling of the router. Standard LoRA fine-tuning works on the expert weights but is less natural for the router (which is already a tiny linear). Most practitioners freeze the router, apply LoRA to the expert SwiGLU matrices (target_modules of `w1`, `w2`, `w3` per expert in HuggingFace TRL), and train with a small auxiliary balancing loss to prevent drift. Fine-tuning a 4-bit quantised Mixtral 8x7B on a 2 % slice of ultrachat_200k takes a single 80 GB H100 and a few hours at typical batch sizes — enough to demonstrate the recipe before scaling to a full instruction set.
# Illustrative: load Mixtral 8x22B on 8x H100 SXM5 with vLLM.
# pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="mistralai/Mixtral-8x22B-Instruct-v0.1",
tensor_parallel_size=8, # one H100 per shard
dtype="bfloat16",
gpu_memory_utilization=0.90,
max_model_len=32768,
enable_prefix_caching=True,
enable_chunked_prefill=True,
# vLLM uses fused grouped-GEMM MoE kernels by default on Hopper.
)
params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)
out = llm.generate(["Explain mixture of experts in three paragraphs."], params)
print(out[0].outputs[0].text)Trade-offs and known limitations#
MoE trades dense-Transformer's simplicity for a parameter-to-FLOP economic advantage that is real but conditional. Understanding the conditions matters before adopting it.
The good story first. Inference FLOPs per token scale with active parameters, not total — roughly 2 * P_active forward. DeepSeek-V3 is 2 * 37e9 = 74 GFLOPs per token, one-eighteenth of Llama 3.1 405B's 810 GFLOPs per token at comparable quality. Training compute follows the same shape (6 * P_active * T_tokens). On a fixed cluster a 37B-active MoE serves around 10x the tokens per second per dollar of a 405B dense model, as long as the cluster has the HBM to host all experts. That ratio is the central economic claim of the architecture.
The bad story is HBM and interconnect. Inference weight memory is P_total * bytes_per_param, not P_active * bytes_per_param: DeepSeek-V3 at FP8 needs 671 GB of HBM, at BF16 1,342 GB, regardless of how few experts fire per token. Each expert needs to live somewhere — at 256 experts of ~2.6 B params each in FP8, that is 2.6 GB per expert and a minimum of 16-rank expert parallelism. The KV cache is identical to a dense model of the same depth and head count; MoE does not shrink it. Routing also produces an all-to-all communication per layer: at 32k batch tokens, d_model = 7168, BF16, top-8 routing across 16 expert-parallel ranks, that is ~14 GB per layer per step. On NVLink 4 at 900 GB/s this is ~16 ms; on InfiniBand NDR at 400 Gb/s it is ~280 ms. PCIe-only clusters are not viable for >32-expert frontier MoE.
Most teams arrive at MoE by trying to scale a dense model and running out of either compute budget or HBM. Within the family, the design choices have settled. Token-choice top-k (Mixtral, DeepSeek-V3) dominates — it is simpler than expert-choice and easier to deploy; expert-choice (Zhou et al., 2022) achieves perfect load balance by construction but drops tokens that no expert picked, which is fine for training and awkward for serving. Dense-MoE hybrids (Llama 4 Scout, 2025) interleave dense and MoE layers — the dense layers act as the always-on baseline and the MoE layers add specialisation; growing in 2026 but not yet the frontier default. Multi-Token Prediction (DeepSeek-V3) trains with a parallel prediction of the next 2 tokens, orthogonal to MoE but often shipped alongside.
If you start from a strong dense checkpoint, the upcycling path is well-trodden: take the dense FFN, copy it into each of N experts with mild perturbation, and continue training. Komatsuzaki et al. (2022) reported roughly 30-50 % of fresh tokens needed to reach MoE-quality, far cheaper than training from scratch. Most production MoE teams arrive this way rather than from a green field.
Practical implementation notes#
Libraries that implement MoE well in 2026: vLLM ships fused grouped-GEMM MoE kernels by default on Hopper since 0.3 and is the most common serving engine for Mixtral and DeepSeek-V3; SGLang ships pre-tuned expert-parallel configurations with the Hopper-optimised DeepEP all-to-all kernel; TensorRT-LLM covers the same models with NVIDIA's kernel stack; Megablocks (Gale et al.) is the canonical research-grade kernel library; DeepSpeed-MoE and Megatron-Core MoE cover the training side. The per-expert Python loop in the quick-start above is fine for understanding but is the slow path — production stacks compile every expert into a single grouped GEMM that runs as one CUDA kernel.
Router collapse is the failure mode that kills more amateur MoE training runs than every other issue combined. The symptom is loss diverging to NaN around step ~5k; the underlying cause is one expert receiving the bulk of tokens and gradient overflowing in it. The fix is either a Switch-style auxiliary balancing loss with weight 0.01, or — better in 2026 — the DeepSeek-V3 auxiliary-loss-free bias-update scheme. Verify expert init variance matches the dense baseline. Monitor per-expert token count from step 100 onward; by the time loss diverges at step 5,000, the imbalance has been building for hours.
Several other failure modes have characteristic shapes. If router entropy is stuck at log(2) regardless of input, the router is dead — W_router weights have underflowed or saturated; reinitialise with smaller variance (1/sqrt(d_model)) and raise the router learning rate to 2-3x the base LR. Capacity overflow >20 % at training usually means the capacity factor is too tight (raise from 1.25 to 1.5 or 2.0) or the training data has shifted distribution. Distributed training stalls at the expert-parallel all-to-all are almost always interconnect saturation or NCCL not using NVLink — swap to the DeepEP kernel, verify NVLink topology with `nvidia-smi topo -m`, check NCCL_IB_HCA and NCCL_NET_GDR_LEVEL. Throughput on B200 not 2x H100 as expected typically means all-to-all is not yet using NVLink 5 generation kernels — update NCCL to 2.23+, set NCCL_NVLS_ENABLE=1, and verify TransformerEngine 1.13+ MoE kernels are enabled.
Fine-tuning MoE from a quantised checkpoint deserves caution. Fine-tune divergence from FP8 expert weights happens because FP8 cannot absorb fine-tune updates well and activations exceed the calibrated range. Either dequantise experts to BF16 for fine-tune and requantise after, or use QLoRA-style adapters on top of frozen FP8 experts. Inference quality dropping sharply for code prompts only is a serving-time capacity issue — code experts saturate, tokens get re-routed to non-code experts. Raise the capacity factor at serving, or pin code-detected requests to a separate replica with higher per-expert capacity. Per-token output reproducibility breaks under retry because capacity-driven token re-routing changes which expert ran; pin the random seed for capacity routing, or accept non-determinism at overflow. Expert utilisation that looks healthy at training but collapses at serving is the small-batch problem — raise batch size, apply a higher capacity factor at serving, or accept and continue. Cold-start latency 3-5x higher than dense baseline is the all-experts-must-load problem; pre-load all experts at replica start, or use weight-streaming with a hot-set of frequently-routed experts pinned.
Operational signals worth tracking: per-expert token count per step (should be approximately uniform, within ~1.5x of average), router entropy (should sit near log(N); below log(N/2) signals collapse), capacity overflow rate (healthy <5 %, unhealthy >20 %), all-to-all kernel time per layer as a percentage of total step time (alert when >30 %), and per-expert activation max-abs drift if serving FP8 (re-calibrate quarterly if it shifts >2x).
Sizing arithmetic for a planning conversation: inference weight memory is P_total * bytes_per_param (FP8 expert weights with BF16 router and shared expert is the standard production trade-off, typically under 1 % quality loss). Mixtral 8x7B needs around 47 GB FP8 and fits on 2x H100 SXM5; Mixtral 8x22B around 141 GB on 4x H100; DeepSeek-V2 around 236 GB on 8x H100; Qwen3-MoE 235B around 235 GB on 8x H100; DeepSeek-V3 around 671 GB on 16x H100 (two nodes); Llama 4 Scout around 109 GB on 4x H100. Top-k larger than ~8-16 erodes the FLOP saving and increases all-to-all volume linearly; most frontier models settle at k = 2 or k = 8. Auxiliary loss weight is typically 0.01-0.1: higher stabilises balance but degrades language-modelling loss; lower risks collapse. Runtime-level throughput and $/token tables live in the inference-runtime entries (vllm, tensorrt-llm, sglang), not here.
Model card discipline for MoE deserves an extra line item: total parameters, active parameters per token, expert configuration, routing scheme and auxiliary loss settings. Buyers and auditors need both parameter counts to reason about deployment cost and behaviour. Activation quantisation interacts with MoE because expert outputs vary in dynamic range — sensitive deployments (medical, financial) should validate per-expert calibration after any precision change.
Router collapse is the failure mode that kills more amateur MoE training runs than every other issue combined. Monitor per-expert token count from step 100 onward — by the time loss diverges at step 5,000, the imbalance has been building for hours.
Where MoE fits in the Yobitel stack#
Mixture-of-Experts models are a first-class citizen in Yobibyte's model catalogue. Customers pick Mixtral 8x22B, DeepSeek-V3, Qwen3-MoE 235B or Llama 4 Scout by name in their workspace, and Yobibyte routes inference through industry-standard runtimes selected per workload — with the MoE-specific details (expert parallelism layout, FP8 calibration, capacity factor at serving, KV cache management) handled transparently. The customer sees an OpenAI-compatible endpoint, not the all-to-all routing under it.
Omniscient Compute — Yobitel's compute-orchestration fabric — pays close attention to MoE workloads. The picker reasons about total weight footprint (drives HBM requirement), active parameters (drives FLOP requirement), and interconnect topology (drives all-to-all viability). DeepSeek-V3 lands on 2x 8x H100 SXM5 nodes with NVLink within each node and InfiniBand NDR between; smaller MoE models like Mixtral 8x7B may land on a single H200 to exploit the larger HBM. None of this is exposed to the customer; the workspace abstracts it.
InferenceBench publishes MoE serving benchmarks alongside dense ones, with the same methodology: tokens-per-second, time-to-first-token, p99 latency, cost-per-million-tokens. The MoE results consistently demonstrate the architectural argument made above — DeepSeek-V3 reaches Llama-3.1-405B-class quality at a fraction of the per-token cost on the same hardware. For teams selecting between dense and MoE, the InferenceBench data is the empirical complement to the architectural reasoning here.
References
- Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer (Shazeer et al., 2017) · arXiv
- Switch Transformer (Fedus et al., 2021) · arXiv
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Du et al., 2022) · arXiv
- Mixture-of-Experts with Expert Choice Routing (Zhou et al., 2022) · arXiv
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022) · arXiv
- Mixtral of Experts (Jiang et al., 2024) · arXiv
- DeepSeek-V3 Technical Report (2024) · arXiv
- DeepSeekMoE: Toward Ultimate Expert Specialization (Dai et al., 2024) · arXiv