GPT OSS 120B
OpenAI
Multilingual Thinking Tool Calls
GPT OSS 120B is a 120.41-billion-parameter Mixture-of-Experts model from OpenAI, trained with large-scale distillation and reinforcement learning for agentic reasoning. With 128 experts and 4 active per token, it fits on a single 80 GB GPU thanks to native MXFP4 quantization of MoE weights. The model supports configurable reasoning effort, function calling, and multilingual conversation across 12 languages. Its 128K context window and flash attention enable long-document workflows, while the Apache 2.0 license allows unrestricted commercial use. GGUF quants are available for self-hosted inference with llama.cpp.
Hardware Configuration
Optional — for precise deployment recommendations
| Quantization | Quality | Size | Fit |
|---|---|---|---|
| FP16 | Full precision | 60.88 GB | — |
| Q8_0 | High | 59.03 GB | — |
| Q8_K_XL | High | 60.04 GB | — |
| Q6_K | High | 58.93 GB | — |
| Q6_K_XL | High | 58.93 GB | — |
| Q5_K_M | Medium | 58.57 GB | — |
| Q5_K_S | Medium | 58.56 GB | — |
| Q4_K_M | Medium | 58.46 GB | — |
| Q4_K_S | Medium | 58.45 GB | — |
| Q4_K_XL | Medium | 58.69 GB | — |
| Q4_0 | Medium | 58.32 GB | — |
| Q4_1 | Medium | 58.41 GB | — |
| Q3_K_M | Low | 58.32 GB | — |
| Q3_K_S | Low | 58.27 GB | — |
| Q2_K | Low | 58.27 GB | — |
| Q2_K_L | Low | 58.54 GB | — |
Last updated: March 5, 2026