GPT OSS 20B
OpenAI
Multilingual Thinking Tool Calls
GPT OSS 20B is a 21.51-billion-parameter Mixture-of-Experts model from OpenAI, optimized for low-latency reasoning on consumer hardware. With 32 experts and 4 active per token, it runs within 16 GB of memory using native MXFP4 quantization. The model offers configurable reasoning effort, function calling, and multilingual conversation across 12 languages. A 128K context window and flash attention support long-document tasks, while the Apache 2.0 license and small active footprint make it ideal for local and latency-sensitive deployments. GGUF quants are available for local inference with llama.cpp.
Hardware Configuration
Optional — for precise deployment recommendations
| Quantization | Quality | Size | Fit |
|---|---|---|---|
| FP16 | Full precision | 12.85 GB | — |
| Q8_0 | High | 11.28 GB | — |
| Q8_K_XL | High | 12.29 GB | — |
| Q6_K | High | 11.21 GB | — |
| Q6_K_XL | High | 11.21 GB | — |
| Q5_K_M | Medium | 10.91 GB | — |
| Q5_K_S | Medium | 10.91 GB | — |
| Q4_K_M | Medium | 10.83 GB | — |
| Q4_K_S | Medium | 10.82 GB | — |
| Q4_K_XL | Medium | 11.06 GB | — |
| Q4_0 | Medium | 10.71 GB | — |
| Q4_1 | Medium | 10.78 GB | — |
| Q3_K_M | Low | 10.72 GB | — |
| Q3_K_S | Low | 10.68 GB | — |
| Q2_K | Low | 10.68 GB | — |
| Q2_K_L | Low | 10.95 GB | — |
Last updated: March 5, 2026