NVIDIA Nemotron 3 Super 120B A12B

Code Multilingual Thinking Tool Calls

Nemotron 3 Super 120B A12B is a 123.61-billion-parameter hybrid Mamba-2 Transformer LatentMoE model from NVIDIA, activating 12 billion parameters per token across 22 of 512 routed experts plus 1 shared expert. Trained on over 25 trillion tokens, it targets agentic reasoning, code generation, tool calling, and multilingual conversation in 7 languages. A 256K context window, toggleable thinking mode, and multi-token prediction enable high-throughput inference for complex multi-agent workflows. Its MoE sparsity quantizes well to GGUF for self-hosted deployment on multi-GPU setups.

Hardware Configuration

Vendor

Product

Platform

Family

Model

VRAM

System RAM (GB) Optional — for precise deployment recommendations

Quantization	Quality	Size	Fit
Q8_0	High	119.65 GB	—
Q8_K_XL	High	123.39 GB	—
Q6_K	High	106.87 GB	—
Q6_K_XL	High	109.75 GB	—
Q5_K_M	Medium	99.97 GB	—
Q5_K_S	Medium	83.56 GB	—
Q5_K_XL	Medium	100.19 GB	—
Q4_K_M	Medium	76.87 GB	—
Q4_K_S	Medium	73.59 GB	—
Q4_K_XL	Medium	78.02 GB	—
MXFP4_MOE	Medium	76.42 GB	—
Q3_K_M	Low	57.48 GB	—
Q3_K_S	Low	57.48 GB	—
Q3_K_XL	Low	58.33 GB	—
Q2_K_XL	Low	50.9 GB	—

Last updated: April 29, 2026