Dropped Tokens

Dropped Tokens are tokens that are discarded in sparse Mixture of Experts models when their selected expert has exceeded its processing capacity buffer — causing information loss, training instability, and inconsistent outputs — the most visible failure mode of discrete top-k routing in MoE architectures, driving the development of alternative routing strategies (expert choice, soft MoE, capacity-factor tuning) that eliminate or minimize this pathological behavior.

What Are Dropped Tokens?

- Definition: In top-k MoE routing, each token selects its preferred experts, but if an expert receives more tokens than its capacity buffer allows (capacity = batch_size / num_experts × capacity_factor), excess tokens are "dropped" — their representation passes through only the residual connection, bypassing the expert FFN entirely.
- Capacity Factor: The buffer multiplier (typically 1.0–1.5) controlling how many tokens each expert can accept. A capacity factor of 1.0 means each expert can handle exactly (batch_size / num_experts) tokens — any imbalance causes drops.
- Information Loss: Dropped tokens receive no expert processing — in tasks where every token matters (translation, code generation), dropped tokens introduce systematic errors.
- Non-Deterministic Behavior: The same input processed in different batch compositions may have different tokens dropped (because drop decisions depend on the batch's routing distribution) — causing inconsistent outputs for identical inputs.

Why Dropped Tokens Are a Problem

- Quality Degradation: Token drop rates of 5–15% are common in poorly tuned MoE training — this means 5–15% of tokens in every forward pass receive reduced processing, systematically degrading model quality.
- Training-Inference Mismatch: Drop rates during training differ from inference (different batch sizes) — the model learns to compensate for drops that don't occur at inference, or encounters drops at inference it never saw during training.
- Gradient Noise: Tokens dropped in the forward pass still generate gradients through the residual — but these gradients don't reflect the expert processing, introducing noise into the router's gradient signal.
- Unpredictable Quality: Drop rates vary with input distribution — batches with unusual token distributions experience higher drops, creating unpredictable quality variation in production.
- Fairness Concerns: Common tokens (that match popular expert specializations) are rarely dropped, while rare or out-of-distribution tokens are frequently dropped — systematically under-serving uncommon inputs.

Mitigation Strategies

Capacity Factor Tuning:
- Increase capacity factor from 1.0 to 1.5 or 2.0 — allows each expert to accept more tokens.
- Trade-off: higher capacity factors increase memory usage and reduce efficiency benefits of sparsity.
- Monitoring: track actual drop rate during training and increase capacity until drops are <1%.

Load Balancing Loss:
- Auxiliary loss encouraging uniform expert utilization reduces the routing imbalance that causes drops.
- Effective but doesn't guarantee zero drops — extreme batches can still overflow popular experts.

Expert Choice Routing:
- Invert routing direction — experts select tokens instead of tokens selecting experts.
- Each expert processes exactly k tokens — drops are eliminated by construction.
- Trade-off: variable number of experts per token.

Soft MoE:
- Replace discrete routing with continuous soft weights — every token contributes to every expert.
- No discrete assignment means no capacity limits and no drops.
- Trade-off: loses inference sparsity benefit.

Dropped Token Impact Analysis

| Drop Rate | Quality Impact | Cause | Action |
|-----------|---------------|-------|--------|
| <1% | Negligible | Normal routing variance | Acceptable |
| 1–5% | Measurable degradation | Moderate imbalance | Increase capacity factor |
| 5–15% | Significant quality loss | Poor load balance | Add/tune balance loss |
| >15% | Training failure | Router collapse | Switch routing strategy |

Dropped Tokens are the canary in the MoE coal mine — the most visible symptom of routing pathology that signals expert underutilization, load imbalance, and wasted model capacity, driving the evolution from naive top-k routing toward more sophisticated routing mechanisms that achieve sparse computation without sacrificing tokens.

Want to learn more?