Google's Quantized Gemma 4: Runs Locally on Phones & Thin Laptops

Google's new QAT quantized Gemma 4 runs locally on phones and laptops, with memory under 1GB.

Jun 09, 2026

∙ Paid

Gemma 4: Our most capable open models to date

Two months ago, Google released the Gemma 4 series models. Since then, they haven’t slowed down: they first introduced Multi-Token Prediction (MTP) to accelerate inference, and two days ago they released the 12B parameter version, filling the gap between the E4B and 26B MoE models.

Today, Google has launched new checkpoint versions centered around Quantization-Aware Training (QAT), with a single goal: to enable Gemma 4 to run smoothly on everyday consumer hardware like phones and laptops, while maintaining quality nearly on par with the unquantized models.

The most significant result is that the memory footprint of Gemma 4 E2B has been compressed down to 1GB.

Additionally, although the 12B model can run on devices with 16GB of RAM/VRAM, its speed was previously too slow. After this new QAT quantization, I tested the 12B-QAT model on a 16GB M5 MacBook Air using LM Studio. It is noticeably more usable than the previous quantized 12B version — the token generation speed is now acceptable. For a truly usable local multimodal model, I recommend a machine with 32GB of RAM/VRAM.

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.

Top Python Libraries

Google's Quantized Gemma 4: Runs Locally on Phones & Thin Laptops

Google's new QAT quantized Gemma 4 runs locally on phones and laptops, with memory under 1GB.

Continue reading this post for free, courtesy of Meng Li.