NVIDIA Nemotron 3 Nano 4B
NVIDIA
Code Thinking Tool Calls
Nemotron 3 Nano 4B is a 3.97-billion-parameter hybrid Mamba-2/Transformer dense model from NVIDIA, compressed from the larger 9B Nano v2 model. It supports a toggleable thinking mode, tool calling, and code generation, making it well suited for agentic and reasoning workloads. A 262K context window and flash attention enable long-context workflows on modest hardware. GGUF quantizations range from 2 to 8 GB, making it ideal for edge devices and consumer GPUs where memory is limited.
Hardware Configuration
Optional — for precise deployment recommendations
| Quantization | Quality | Size | Fit |
|---|---|---|---|
| BF16 | Full precision | 7.96 GB | — |
| Q8_0 | High | 4.23 GB | — |
| Q8_K_XL | High | 5.63 GB | — |
| Q6_K | High | 4.06 GB | — |
| Q6_K_XL | High | 4.56 GB | — |
| Q5_K_M | Medium | 3.16 GB | — |
| Q5_K_S | Medium | 3.11 GB | — |
| Q5_K_XL | Medium | 3.31 GB | — |
| Q4_K_M | Medium | 2.9 GB | — |
| Q4_K_S | Medium | 2.83 GB | — |
| Q4_K_XL | Medium | 3.13 GB | — |
| IQ4_NL | Medium | 2.57 GB | — |
| IQ4_XS | Medium | 2.54 GB | — |
| Q4_0 | Medium | 2.53 GB | — |
| Q4_1 | Medium | 2.71 GB | — |
| Q3_K_M | Low | 2.46 GB | — |
| Q3_K_S | Low | 2.36 GB | — |
| Q3_K_XL | Low | 2.68 GB | — |
| IQ3_XXS | Low | 2.39 GB | — |
| Q2_K_XL | Low | 2.5 GB | — |
| IQ2_M | Low | 2.3 GB | — |
| IQ2_XXS | Low | 2.18 GB | — |
Last updated: March 17, 2026