Sparse Mixture-of-Experts (MoE) Gating is the routing mechanism that selects which expert networks process each token in an MoE model — enabling scaling to trillions of parameters while keeping per-token computation constant.
MoE Architecture Overview
- Replace each FFN layer with E parallel expert networks.
- For each token, a gating network selects the top-K experts.
- Only K experts compute the output — rest are inactive.
- Parameter count scales with E; compute scales with K (not E).
Gating Mechanism
$$G(x) = Softmax(TopK(x \cdot W_g))$$
- $W_g$: learned routing weight matrix.
- Top-K: Keep only the K highest scores, zero the rest.
- Weighted sum of selected expert outputs.
Load Balancing Problem
- Without regularization, the router collapses — all tokens go to a few popular experts.
- Other experts get no gradient signal and become useless.
- Solution: Auxiliary Load Balancing Loss — penalize imbalanced routing:
$L_{aux} = \alpha \sum_e f_e \cdot p_e$
where $f_e$ = fraction of tokens routed to expert $e$, $p_e$ = mean gating probability.
Expert Capacity
- Each expert has a fixed capacity (max tokens per batch).
- Overflow tokens are dropped or passed through a residual connection.
- Capacity factor CF=1.0: No slack; CF=1.25: 25% headroom.
MoE Routing Variants
- Top-1 Routing (Switch Transformer): Single expert per token — simpler, load issues.
- Top-2 Routing (GShard, Mixtral): Two experts — better quality, manageable overhead.
- Expert Choice (Zoph et al., 2022): Experts choose tokens rather than tokens choosing experts — perfect load balance.
- Soft Routing: All experts compute, weighted combination (expensive but no dropped tokens).
Production MoE Models
| Model | Experts | Active/Token | Total Params |
|-------|---------|-------------|----------|
| Mixtral 8x7B | 8 | 2 | 47B |
| DeepSeek-V3 | 256 | 8 | 671B |
| GPT-4 (estimated) | ~16 | 2 | ~1.8T |
MoE gating is the key to scaling LLMs beyond the memory/compute frontier — it decouples parameter count from inference cost, enabling trillion-parameter models at 7B-class inference cost.