Tech Study Guide
MoE Inference
Mixture-of-Experts inference with active vs total parameters, expert routing, expert parallelism, all-to-all communication, load balance, capacity, KV cache, and serving tradeoffs.
MoE Inference
Mixture-of-Experts models contain many expert feed-forward networks but route each token to only a subset. This can increase total parameter count without activating every parameter per token, but it adds routing, load balancing, communication, and serving complexity.
Command Examples
model:
total_parameters:
active_parameters_per_token:
num_experts:
experts_per_token:
layers_with_experts:
Example output and meaning:
| Command | Example output | What it does |
|---|---|---|
Captured fields |
Named fields with concrete values: IDs, scores, tokens, routes, states, timestamps, or errors. |
Turns a capture template into evidence you can compare across runs. |
Always distinguish total parameters from active parameters. Total parameters drive storage and loading. Active parameters drive per-token compute.
MoE Concepts
| Concept | Serving Meaning |
|---|---|
| Router/gate | Chooses experts per token. |
| Expert | Usually an MLP block specialized by training. |
| Top-k experts | Number of experts activated per token. |
| Active parameters | Parameters used for one token. |
| Total parameters | Full model storage and memory footprint. |
| Expert parallelism | Splits experts across devices. |
| All-to-all | Communication pattern for dispatching tokens to experts. |
| Load imbalance | Hot experts cause stalls or dropped capacity. |
Performance Tradeoffs
| Benefit | Cost |
|---|---|
| More capacity without activating all weights. | More complex routing and communication. |
| Potentially strong quality per active FLOP. | Expert imbalance can hurt throughput. |
| Experts can specialize. | Debugging token routing is harder. |
| Sparse activation lowers compute. | Total weights still need storage/loading. |
MoE attention still uses KV cache. Expert routing changes MLP compute and communication, not the fundamental need to cache attention keys and values during autoregressive decode.
Operational Checks
| Check | Why |
|---|---|
| Active vs total params | Capacity and cost planning. |
| Expert utilization | Detect hot or unused experts. |
| All-to-all latency | Interconnect bottleneck signal. |
| Batch shape | Token routing imbalance can worsen with small batches. |
| Quantization compatibility | Experts and router may have different sensitivity. |
| Fallback route | MoE serving issues can require dense-model fallback. |
Incident Flow
- Confirm whether latency is attention, expert MLP, or communication bound.
- Compare expert utilization and all-to-all timing.
- Check batch size, sequence length mix, and routing skew.
- Compare dense baseline if available.
- Check quantization or runtime changes near the incident.
Study Cards
Why separate active and total parameters for MoE?
Total parameters affect storage/loading, while active parameters affect per-token compute.
What makes MoE inference operationally harder?
Expert routing, load imbalance, expert parallelism, and all-to-all communication add bottlenecks.
Does MoE remove KV-cache pressure?
No. MoE changes expert MLP compute, but attention still needs KV cache during autoregressive inference.