Google's Quantized Gemma 4: Runs Locally on Phones & Thin Laptops
Google's new QAT quantized Gemma 4 runs locally on phones and laptops, with memory under 1GB.
Two months ago, Google released the Gemma 4 series models. Since then, they haven’t slowed down: they first introduced Multi-Token Prediction (MTP) to accelerate inference, and two days ago they released the 12B parameter version, filling the gap between the E4B and 26B MoE models.
Today, Google has launched new checkpoint versions centered around Quantization-Aware Training (QAT), with a single goal: to enable Gemma 4 to run smoothly on everyday consumer hardware like phones and laptops, while maintaining quality nearly on par with the unquantized models.
The most significant result is that the memory footprint of Gemma 4 E2B has been compressed down to 1GB.
Additionally, although the 12B model can run on devices with 16GB of RAM/VRAM, its speed was previously too slow. After this new QAT quantization, I tested the 12B-QAT model on a 16GB M5 MacBook Air using LM Studio. It is noticeably more usable than the previous quantized 12B version — the token generation speed is now acceptable. For a truly usable local multimodal model, I recommend a machine with 32GB of RAM/VRAM.



