GLM 4.7 Flash

Code Thinking Tool Calls

GLM-4.7 Flash is a 31.22-billion-parameter Mixture-of-Experts model from the GLM team at Zai Org, optimized for fast inference on agentic and coding tasks. It activates 4 of 64 experts plus 1 shared expert per token, delivering strong performance in the 30B class while keeping compute costs low. The model supports code generation, extended thinking, and tool calling in English and Chinese. With a 198K context window and flash attention, it quantizes well to GGUF and pairs with speculative decoding for high-throughput self-hosted deployments.

Hardware Configuration

Vendor

Product

Platform

Family

Model

VRAM

System RAM (GB) Optional — for precise deployment recommendations

Quantization	Quality	Size	Fit
Q8_0	High	29.66 GB	—
Q8_K_XL	High	32.71 GB	—
Q6_K	High	23 GB	—
Q6_K_XL	High	24.26 GB	—
Q5_K_M	Medium	19.94 GB	—
Q5_K_S	Medium	19.39 GB	—
Q5_K_XL	Medium	20.13 GB	—
Q4_K_M	Medium	17.05 GB	—
Q4_K_S	Medium	16.08 GB	—
Q4_K_XL	Medium	16.32 GB	—
MXFP4_MOE	Medium	15.8 GB	—
Q4_0	Medium	16.03 GB	—
Q4_1	Medium	17.67 GB	—
Q3_K_M	Low	13.61 GB	—
Q3_K_S	Low	12.38 GB	—
Q3_K_XL	Low	12.86 GB	—
Q2_K_XL	Low	11.07 GB	—

Last updated: March 24, 2026