NVIDIA Nemotron 3 Super 120B A12B
NVIDIA
Code Multilingual Thinking Tool Calls
Nemotron 3 Super 120B A12B is a 123.61-billion-parameter hybrid Mamba-2 Transformer LatentMoE model from NVIDIA, activating 12 billion parameters per token across 22 of 512 routed experts plus 1 shared expert. Trained on over 25 trillion tokens, it targets agentic reasoning, code generation, tool calling, and multilingual conversation in 7 languages. A 256K context window, toggleable thinking mode, and multi-token prediction enable high-throughput inference for complex multi-agent workflows. Its MoE sparsity quantizes well to GGUF for self-hosted deployment on multi-GPU setups.
Hardware Configuration
Optional — for precise deployment recommendations
| Quantization | Quality | Size | Fit |
|---|---|---|---|
| MXFP4_MOE | Very high | 76.42 GB | — |
| Q8_0 | High | 119.65 GB | — |
| Q8_K_XL | High | 123.39 GB | — |
| Q6_K | High | 106.87 GB | — |
| Q6_K_XL | High | 109.75 GB | — |
| Q5_K_M | Medium | 99.97 GB | — |
| Q5_K_S | Medium | 83.56 GB | — |
| Q5_K_XL | Medium | 100.19 GB | — |
| Q4_K_M | Medium | 76.87 GB | — |
| Q4_K_S | Medium | 73.59 GB | — |
| Q4_K_XL | Medium | 78.02 GB | — |
| Q3_K_M | Low | 57.48 GB | — |
| Q3_K_S | Low | 57.48 GB | — |
| Q3_K_XL | Low | 58.33 GB | — |
| Q2_K_XL | Low | 50.9 GB | — |
Last updated: March 12, 2026