GLM 4.7 Flash
Zai Org
Code Thinking Tool Calls
GLM-4.7 Flash is a 31.22-billion-parameter Mixture-of-Experts model from the GLM team at Zai Org, optimized for fast inference on agentic and coding tasks. It activates 4 of 64 experts plus 1 shared expert per token, delivering strong performance in the 30B class while keeping compute costs low. The model supports code generation, extended thinking, and tool calling in English and Chinese. With a 198K context window and flash attention, it quantizes well to GGUF and pairs with speculative decoding for high-throughput self-hosted deployments.
Hardware Configuration
Optional — for precise deployment recommendations
| Quantization | Quality | Size | Fit |
|---|---|---|---|
| MXFP4_MOE | Very high | 15.8 GB | — |
| Q8_0 | High | 29.66 GB | — |
| Q8_K_XL | High | 32.71 GB | — |
| Q6_K | High | 23 GB | — |
| Q6_K_XL | High | 24.26 GB | — |
| Q5_K_M | Medium | 19.94 GB | — |
| Q5_K_S | Medium | 19.39 GB | — |
| Q5_K_XL | Medium | 20.13 GB | — |
| Q4_K_M | Medium | 17.05 GB | — |
| Q4_K_S | Medium | 16.08 GB | — |
| Q4_K_XL | Medium | 16.32 GB | — |
| Q4_0 | Medium | 16.03 GB | — |
| Q4_1 | Medium | 17.67 GB | — |
| Q3_K_M | Low | 13.61 GB | — |
| Q3_K_S | Low | 12.38 GB | — |
| Q3_K_XL | Low | 12.86 GB | — |
| Q2_K_XL | Low | 11.07 GB | — |
Last updated: March 12, 2026