Home Knowledge Base Mixture of Experts (MoE)

Mixture of Experts (MoE) is an architecture where models contain multiple specialized sub-networks ("experts") but only activate a subset for each input — enabling much larger total models with similar inference cost to smaller dense models, powering frontier models like Mixtral and reportedly GPT-4 with efficient scaling.

What Is Mixture of Experts?

Why MoE Matters

MoE Architecture

Standard Transformer:

Input → Attention → FFN → Output
                    ↑
                 Dense FFN
                 (all parameters used)

MoE Transformer:

Input → Attention → Router → Output
                       ↓
            ┌─────────────────────────┐
            │ Expert 1 │ Expert 2 │...│ Expert N
            └─────────────────────────┘
                  ↓ (select top-k)
            Weighted sum of selected experts

Components:

Router Mechanism

# Simplified router logic
def route(x, expert_weights):
    # x: input token embedding
    # expert_weights: learned routing matrix
    
    # Compute routing scores
    scores = softmax(x @ expert_weights)  # [num_experts]
    
    # Select top-k experts
    top_k_experts = topk(scores, k=2)
    
    # Compute weighted output
    output = sum(
        score[i] * expert[i](x) 
        for i in top_k_experts
    )
    return output

MoE Models Comparison

Model           | Total Params | Active | Experts | K
----------------|--------------|--------|---------|----
Mixtral 8x7B    | 47B          | 13B    | 8       | 2
Mixtral 8x22B   | 141B         | 39B    | 8       | 2
Switch-C        | 1.6T         | ~6B    | 2048    | 1
GPT-4 (rumored) | ~1.8T        | ~280B  | 16      | 2
DeepSeek-V2     | 236B         | 21B    | 160     | 6
Grok-1          | 314B         | ~86B   | 8       | 2

MoE Benefits

Computational Efficiency:

Specialization:

MoE Challenges

Memory Overhead:

Memory = All experts loaded (even if only k used)
8x7B model: ~90GB for all weights
vs. 7B dense: ~14GB

Expert parallelism helps distribute

Training Complexity:

Routing Noise:

Inference Challenges:

Serving MoE Models

Expert Parallelism:

GPU 0: Experts 0-1
GPU 1: Experts 2-3
GPU 2: Experts 4-5
GPU 3: Experts 6-7

All-to-all communication for routing

vLLM MoE Support:

MoE architecture is the key to scaling frontier AI models — by activating only a fraction of parameters per input, MoE enables models with trillions of parameters while keeping inference costs manageable, representing the current state-of-the-art approach for pushing AI capabilities further.

moemixture of expertsexpertsgatingsparse modelmixtralroutingefficiency

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.