Skip to content

GLM 4.7 Flash

Zai Org
Code Thinking Tool Calls

GLM-4.7 Flash is a 31.22-billion-parameter Mixture-of-Experts model from the GLM team at Zai Org, optimized for fast inference on agentic and coding tasks. It activates 4 of 64 experts plus 1 shared expert per token, delivering strong performance in the 30B class while keeping compute costs low. The model supports code generation, extended thinking, and tool calling in English and Chinese. With a 198K context window and flash attention, it quantizes well to GGUF and pairs with speculative decoding for high-throughput self-hosted deployments.

Hardware Configuration

Optional — for precise deployment recommendations
Quantization Quality Size Fit
MXFP4_MOE Very high 15.8 GB
Q8_0 High 29.66 GB
Q8_K_XL High 32.71 GB
Q6_K High 23 GB
Q6_K_XL High 24.26 GB
Q5_K_M Medium 19.94 GB
Q5_K_S Medium 19.39 GB
Q5_K_XL Medium 20.13 GB
Q4_K_M Medium 17.05 GB
Q4_K_S Medium 16.08 GB
Q4_K_XL Medium 16.32 GB
Q4_0 Medium 16.03 GB
Q4_1 Medium 17.67 GB
Q3_K_M Low 13.61 GB
Q3_K_S Low 12.38 GB
Q3_K_XL Low 12.86 GB
Q2_K_XL Low 11.07 GB
Last updated: March 12, 2026