Curiosity · Concept

Mixture of Experts (MoE)

Mixture of Experts replaces a dense feed-forward block with many parallel experts plus a learned router. For each token, only the top-k experts fire. This decouples total capacity from active compute, so a 600B-parameter MoE can cost like a 30B-parameter dense model at inference. DeepSeek-V3, Mixtral, and Google's Gemini use this design.

Quick reference

Proficiency
Advanced
Also known as
MoE, Sparse MoE, SMoE
Prerequisites
Transformer architecture, Feed-forward networks

Frequently asked questions

What is Mixture of Experts?

An architecture where the feed-forward layer of a Transformer is split into many parallel 'expert' sub-networks, and a learned router assigns each token to a small subset (usually 1 or 2). Only those experts run, so you get large total parameters with small per-token cost.

Why use MoE instead of a dense model?

Because it decouples capacity from compute. A dense 70B model runs 70B params per token. A 400B MoE with 2-of-64 routing might run only ~12B active params per token but have 400B of knowledge to draw on, giving better quality at similar speed.

What is load balancing in MoE?

Without a balancing signal, the router tends to route everything to a few favorite experts, wasting the rest. An auxiliary loss penalizes imbalance so traffic spreads across experts. DeepSeek's 'auxiliary-loss-free balancing' uses per-expert bias terms instead.

Which modern LLMs are MoE?

DeepSeek-V3 and DeepSeek-R1, Mixtral 8x7B and 8x22B, Qwen 2.5-MoE, Databricks DBRX, and reportedly GPT-4 and Gemini Ultra. Frontier-scale models have mostly moved toward sparse MoE designs.

Sources

  1. Shazeer et al. — Outrageously Large Neural Networks (Sparsely-Gated MoE) — accessed 2026-04-20
  2. Fedus et al. — Switch Transformers — accessed 2026-04-20
  3. DeepSeek-V3 Technical Report — accessed 2026-04-20
  4. Hugging Face — Mixture of Experts explained — accessed 2026-04-20