Curiosity · Concept
Mixture of Experts (MoE)
Mixture of Experts replaces a dense feed-forward block with many parallel experts plus a learned router. For each token, only the top-k experts fire. This decouples total capacity from active compute, so a 600B-parameter MoE can cost like a 30B-parameter dense model at inference. DeepSeek-V3, Mixtral, and Google's Gemini use this design.
Quick reference
- Proficiency
- Advanced
- Also known as
- MoE, Sparse MoE, SMoE
- Prerequisites
- Transformer architecture, Feed-forward networks
Frequently asked questions
What is Mixture of Experts?
An architecture where the feed-forward layer of a Transformer is split into many parallel 'expert' sub-networks, and a learned router assigns each token to a small subset (usually 1 or 2). Only those experts run, so you get large total parameters with small per-token cost.
Why use MoE instead of a dense model?
Because it decouples capacity from compute. A dense 70B model runs 70B params per token. A 400B MoE with 2-of-64 routing might run only ~12B active params per token but have 400B of knowledge to draw on, giving better quality at similar speed.
What is load balancing in MoE?
Without a balancing signal, the router tends to route everything to a few favorite experts, wasting the rest. An auxiliary loss penalizes imbalance so traffic spreads across experts. DeepSeek's 'auxiliary-loss-free balancing' uses per-expert bias terms instead.
Which modern LLMs are MoE?
DeepSeek-V3 and DeepSeek-R1, Mixtral 8x7B and 8x22B, Qwen 2.5-MoE, Databricks DBRX, and reportedly GPT-4 and Gemini Ultra. Frontier-scale models have mostly moved toward sparse MoE designs.
Sources
- Shazeer et al. — Outrageously Large Neural Networks (Sparsely-Gated MoE) — accessed 2026-04-20
- Fedus et al. — Switch Transformers — accessed 2026-04-20
- DeepSeek-V3 Technical Report — accessed 2026-04-20
- Hugging Face — Mixture of Experts explained — accessed 2026-04-20