The Routing Penalty
The Routing Penalty
Mixture-of-Experts (MoE) models achieve high quality with fewer active parameters by routing each token to a subset of experts. The promise: quality of a large model, cost of a small one. The qs inequality reveals the penalty the promise conceals.
Expert routing fragments microbatches. A dense model processes a batch of tokens through one set of weights — every token sees the same parameters, enabling maximal weight reuse across the batch. An MoE model splits the batch across experts, and each expert sees only its fraction. The weight reuse drops proportionally. This is the first penalty: fragmentation destroys the throughput advantage of batching.
The second penalty: the massive expert pools — DeepSeek-V3 has 256 experts — consume HBM capacity that would otherwise hold KV cache for long-context inference. At 128k context length, the KV cache for a quality-matched dense model fits comfortably. For the MoE model, the expert weights and the KV cache compete for the same memory. Something has to give.
The qs inequality formalizes this: DeepSeek-V3 at 128k context suffers a 4.5x throughput disadvantage versus a quality-matched dense model. Not because the model is bad — because the routing creates a structural tradeoff between model capacity and serving efficiency that worsens with context length.
The through-claim: MoE’s advantage is measured at training time (fewer FLOPs per quality). Its disadvantage is revealed at inference time (lower throughput per dollar). These are different stages with different cost functions. A model that’s cheap to train and expensive to serve optimizes the wrong metric for deployment at scale. The routing that enables quality destroys the batching that enables throughput. Both penalties are structural — they follow from the architecture, not the implementation.