How MoE Models Deploy Differently: A Self-Hosting Guide
Qwen3.5-9B fits in 5.3 GB of VRAM. Its MoE sibling, Qwen3.5-35B-A3B, needs 18.5 GB — yet only activates 3 billion parameters per token. What’s going on?
Mixture-of-Experts models confuse deployers because the “active parameter” count is misleading for hardware planning. You see “3B active” and reach for an 8 GB GPU. Then the model won’t load.
This guide explains the VRAM math, walks through every MoE model you can self-host today, and shows how to make them run on practical hardware. We’ll use the Qwen 3.5 family as a running example because it’s the clearest comparison available: same generation, same training data, same capabilities — only the architecture differs.
TL;DR: Mixture-of-Experts models like DeepSeek V3.1 (685B total, 37B active) and Qwen3.5-35B-A3B (36B total, 3B active) route each token through a fraction of their experts — but every expert’s weights must stay in memory. Qwen3.5-9B (dense) fits in 5.29 GB at Q4_K_M; its MoE sibling needs 18.49 GB at the same quant despite activating fewer params (Unsloth GGUF, 2026). Below Q4_K_M, routing quality degrades — so don’t go lower.
What Does “Mixture of Experts” Mean for Deployment?
Since early 2025, nearly all leading frontier AI models use MoE designs (NVIDIA Blog, 2025). In a dense model every parameter participates in every token. In an MoE model a router network selects a subset of “experts” per token — but all experts must be loaded in memory. That single fact explains almost every deployment surprise you’ll encounter.
Think of it this way. A dense model is a single chef who uses every ingredient in the kitchen for every dish. An MoE model is a restaurant with 256 chefs, each a specialist, but you still need the whole building rented even though only 8 chefs cook at any moment.
Qwen 3.5: the perfect comparison
The Qwen 3.5 family gives us the cleanest possible side-by-side. Both variants share the same training data, the same capabilities (code, multilingual, thinking, tool calls, vision), and the same 262K context window. The only difference is architecture:
- Qwen3.5-9B — dense, 9.65B parameters. Every param fires on every token.
- Qwen3.5-35B-A3B — MoE, 35.95B total parameters, 256 experts, 8 active + 1 shared per token. Roughly 3B params fire per token.
Same family. Same generation. Wildly different deployment profiles. That’s what MoE does to your infrastructure.
Notice the pattern. As MoE models grow, the gap between total and active parameters widens dramatically. Kimi K2.5 has over a trillion parameters but only activates 32 billion per token — 3.1% of the total. Your GPU doesn’t care about that ratio. It needs to hold all of them.
VRAM Requirements — Why MoE Models Need More Than You Expect
Qwen3.5-35B-A3B needs 3.5× more VRAM than Qwen3.5-9B at Q4_K_M despite activating fewer parameters per token (Unsloth GGUF, Unsloth GGUF, 2026). The reason is simple: VRAM for model weights scales with total parameters, not active ones. Only the KV cache scales with active parameters.
The formula
For any quantized GGUF model, VRAM breaks down into two parts:
- Model weights = total_params × bits_per_weight / 8. All experts included.
- KV cache = proportional to active params × context length × batch size.
In a dense model, total params = active params, so the distinction doesn’t matter. In MoE, it matters enormously. Here’s what that looks like with real numbers from our model data:
| Model | Architecture | Total params | Active params | Q4_K_M size |
|---|---|---|---|---|
| Qwen3.5-9B | Dense | 9.65B | 9.65B | 5.29 GB |
| Qwen3.5-35B-A3B | MoE (256 experts) | 35.95B | ~3B | 18.49 GB |
| DeepSeek V3.1 | MoE (256 experts) | 684.53B | ~37B | 377.56 GB |
Our finding: When we built Prositronic’s deploy pages, MoE models needed entirely different VRAM calculations and warning notices. A model with 3B active params might suggest it’ll run on a Raspberry Pi — until you realize all 36 billion weights need to fit in memory. We had to add a dedicated
MOE_EXPERT_CPU_OFFLOADnotice to every MoE deploy page.
The green sliver in the MoE bar is revealing. If you could load only the 3B active parameters, you’d need roughly 1.6 GB. Instead, you need 18.49 GB because all 256 experts — 35.95 billion parameters — must be resident for the router to select from them.
For larger MoE models, multi-GPU setups become unavoidable. DeepSeek V3.1 at Q4_K_M weighs 377.56 GB. Kimi K2.5 exceeds 500 GB. No single consumer GPU comes close. Does this mean you can’t run them at all? Not quite — that’s where expert offloading comes in (we’ll cover that in a later section).
Your MoE Models — A Deployment Comparison
Single active expert configurations deliver 50–80% higher throughput than 8-active configurations (Chitty-Venkata et al., 2025). Six MoE model families are available for self-hosted deployment today, ranging from 36B to over a trillion total parameters. Here’s how they compare:
| Model | Total | Active | Experts (used/total) | Q4 size | Min VRAM |
|---|---|---|---|---|---|
| Qwen3.5-35B-A3B | 35.95B | ~3B | 8+1 / 256 | 18.49 GB | 24 GB |
| Qwen3-235B-A22B | 235B | ~22B | 8 / 64 | ~130 GB | Multi-GPU |
| DeepSeek V3.1 | 684.53B | ~37B | 8+1 / 256 | 377.56 GB | Multi-GPU |
| Kimi K2.5 | 1,016B | ~32B | 8 / 384 | ~550 GB | Multi-GPU |
| Llama 4 Scout | ~109B | ~17B | 1 / 16 | ~60 GB | 2× 48 GB |
| Llama 4 Maverick | ~400B | ~17B | 1 / 128 | ~220 GB | Multi-GPU |
Fine-grained vs coarse-grained routing
Notice the architectural split. DeepSeek V3.1 and Qwen3.5-35B-A3B use 256 small experts (fine-grained routing) — each expert is a narrow specialist. Llama 4 Scout uses just 16 large experts (coarse-grained routing) — each expert is a generalist that handles a broader range of tokens. What does that mean for you?
Fine-grained models can be more precise in how they allocate compute. But they need more memory management and their routing is more sensitive to quantization. Coarse-grained models are simpler to deploy but less flexible. Llama 4 Scout activates only 1 expert per token (not 8), which makes its memory access pattern more predictable but limits specialization.
Quantization Impact on MoE Models
FP8 achieves 25–30% higher throughput than FP16 at the highest batch sizes on H100 GPUs (Chitty-Venkata et al., 2025). But don’t let that tempt you into extreme quantization. MoE models are more sensitive to aggressive quantization than dense models because the router weights must remain high-precision to correctly select experts.
The router is the brain of an MoE model. It examines each token and decides which experts should process it. When you quantize a dense model to Q2, every parameter degrades uniformly. When you quantize an MoE model to Q2, the router starts picking the wrong experts. The result isn’t a gradual quality decline — it’s a cliff.
State-of-the-art MoE models suffer from non-negligible accuracy loss with extreme quantization under 4 bits; researchers have developed methods like MiLo (Mixture of Low-rank compensators) to recover accuracy, but these add complexity (Huang et al., 2025). For practical deployment, our recommendation is simple: don’t go below Q4_K_M.
MXFP4: the exception
There is one exception. Qwen3.5-35B-A3B offers an MXFP4_MOE quant at 20.11 GB that applies 4-bit MX quantization specifically to the expert layers while keeping attention and routing layers at higher precision. This selective approach preserves routing quality while still compressing the bulk of the model. If your hardware supports MXFP4 (NVIDIA Blackwell and later), it’s a strong alternative to Q4_K_M.
Dynamic quantization: a smarter approach
Unsloth’s dynamic quantization strategy selectively compresses MoE expert layers to lower bit-widths while keeping attention and routing layers at higher precision. This is why you see the “UD” prefix on many quant filenames — it stands for “Unsloth Dynamic.” The approach exploits the fact that experts contribute unequally to model quality: shared experts and routing layers are high-value targets for preservation, while rarely-activated experts tolerate more compression.
Expert Offloading Strategies
DeepSeek V3.1 in its TQ1_0 quant runs on a single 24 GB GPU with MoE offloading plus 96–128 GB of system RAM, achieving roughly 1–2 tokens per second (community benchmarks, 2025). Expert offloading is the key technique for running large MoE models on limited hardware. It stores inactive expert weights in system RAM or NVMe and loads them to the GPU on demand.
CPU offloading with llama.cpp
The most practical approach for single-GPU setups. In llama.cpp, you can offload all MoE expert layers to CPU while keeping attention and routing layers on the GPU:
llama-server \
--model Qwen3.5-35B-A3B-UD-Q4_K_M.gguf \
-ot ".ffn_.*_exps.=CPU" \
--n-gpu-layers 999 \
--ctx-size 8192 \
--jinja
The -ot ".ffn_.*_exps.=CPU" flag tells llama.cpp to place all
expert feed-forward layers on the CPU while keeping everything else on the GPU.
This is more effective than using --n-gpu-layers alone, which
offloads entire transformer blocks rather than splitting expert layers
specifically.
Multi-GPU: tensor parallelism wins
Tensor parallelism achieves 2×+ performance gains from 1 to 4 GPUs on H100, outperforming both pipeline parallelism and expert parallelism (Chitty-Venkata et al., 2025). If you have multiple GPUs connected via NVLink, tensor parallelism (TP) splits each layer across GPUs. Expert parallelism (EP) assigns different experts to different GPUs. TP wins because NVLink bandwidth is high enough to make layer splitting efficient, while EP suffers from load-balancing overhead — some experts get more traffic than others.
Speculative decoding hides offloading latency
A recent technique called SpecMoEOff combines speculative decoding with expert offloading, achieving up to 2.5× decode throughput improvement by generating draft tokens while expert weights are being transferred from RAM to GPU. This is still experimental but points toward a future where even trillion-parameter MoE models run on workstation hardware.
Performance Characteristics — What to Expect
Inter-token latency varies by nearly 100% between the best and worst MoE LLM performers (Chitty-Venkata et al., 2025). Dense models have predictable per-token latency because every token follows the same compute path. MoE models don’t — routing decisions create variance. If your application needs consistent response times, this matters.
Cold expert latency spikes
When a token gets routed to a rarely-used expert, and that expert’s weights have been evicted from the GPU cache (or were never loaded in an offloading setup), you get a latency spike. The GPU stalls while it waits for the expert weights to arrive from RAM. These spikes are unpredictable — they depend on the content of the prompt and which experts it activates.
Throughput varies by prompt
Some prompts hit the same small set of experts repeatedly. Others spread load across many experts. This makes MoE throughput fundamentally less predictable than dense model throughput. Shorter sequences (128 tokens) achieve up to 30% higher throughput than 2048-token sequences in MoE models (Chitty-Venkata et al., 2025).
Batching is also less efficient. In a dense model, every token in a batch follows the same compute path. In an MoE model, different tokens in the same batch route to different experts, creating memory access patterns that GPUs handle less efficiently.
When dense wins
Contrast all of this with Qwen3.5-9B. It’s dense. Every token takes the same compute path. Latency is predictable. Throughput is consistent. There are no cold-expert spikes. No routing overhead. It won’t match Qwen3.5-35B-A3B on benchmarks, but for latency-sensitive applications — real-time chat, interactive coding assistants, voice interfaces — that predictability can matter more than raw capability. You can deploy and compare both on Prositronic using our hardware compatibility checker.
Frequently Asked Questions
Why does my MoE model need so much VRAM if only a few experts are active?
All expert weights must reside in memory for instant routing. Qwen3.5-35B-A3B loads 35.95 billion parameters but activates roughly 3 billion per token. The router must be able to select any expert at any time, so every expert stays loaded even though most sit idle on any given forward pass.
Can I run DeepSeek V3 on a single GPU?
Yes, with expert offloading. The TQ1_0 quant fits on one 24 GB GPU with 96–128 GB system RAM, but expect roughly 1–2 tokens per second (community benchmarks, 2025). For usable speeds you’ll want at least two 48 GB GPUs or four 24 GB GPUs with the Q4_K_M quant and tensor parallelism.
What’s the minimum quantization I should use for MoE models?
Q4_K_M. Below this threshold, router weight degradation causes experts to be selected incorrectly, reducing output quality more sharply than equivalent quantization on dense models. MoE models suffer non-negligible accuracy loss with extreme quantization under 4 bits (Huang et al., 2025).
Is an MoE model always better than a dense model of similar active size?
Not for latency-sensitive applications. Qwen3.5-9B (dense) delivers predictable per-token latency without routing overhead. Qwen3.5-35B-A3B (MoE) scores higher on benchmarks but has variable latency due to expert routing. Pick dense when you need consistent response times; pick MoE when you need maximum capability per compute euro.
What’s the difference between fine-grained and coarse-grained MoE?
DeepSeek V3 and Qwen3.5-35B-A3B use 256 small experts (fine-grained). Llama 4 Scout uses 16 larger experts (coarse-grained). Fine-grained routing enables more precise specialization but requires more memory management. Coarse-grained models are simpler to deploy but less flexible in how they allocate compute.
What to Do Next
Here’s what to take away:
- MoE ≠ less VRAM. All expert weights must be loaded regardless of how many are active per token.
- Don’t go below Q4_K_M for MoE models. Router quality degrades sharply below 4-bit quantization.
- Expert offloading is essential for single-GPU setups. Use
-ot ".ffn_.*_exps.=CPU"in llama.cpp. - Dense models win on latency predictability. Choose based on your application’s tolerance for variance.
- Tensor parallelism beats expert parallelism for multi-GPU deployments, especially with NVLink.
Deploy both Qwen3.5-9B and Qwen3.5-35B-A3B on Prositronic to see the differences firsthand. Start with the Qwen3.5-9B deploy page and the Qwen3.5-35B-A3B deploy page, then check the hardware compatibility page to find the right GPU for your workload.