Gemma 4 QAT + MTP: Local Inference at 120 tok/s Challenges Cloud APIs

Google's release of the Quantization-Aware Training (QAT) variants for the Gemma 4 family, combined with the llama.cpp merge of Multi-Token Prediction (MTP) support, has changed the calculus for anyone who builds with LLMs. The numbers are no longer in the realm of "impressive for local" — they're genuinely competitive with cloud APIs. On a $600 RTX 4070 Super with 12GB of VRAM, Gemma 4 12B with MTP pushes 120 tokens per second. That's not a compromise. That's a viable production target.

Three architectural decisions converged to make this possible: QAT, MoE, and MTP. Each is useful on its own. Together, they represent a system-level breakthrough that rewrites the economics of inference.

What QAT Actually Changes

Quantization-Aware Training is not the same as post-training quantization. Standard quantization (GPTQ, AWQ, GGUF's K-quants) takes a trained model and compresses it afterward, accepting a degradation floor you cannot train your way out of. QAT builds the quantization directly into the training process — the model learns to produce high-quality outputs even at 4-bit precision.

Google released QAT variants across the entire Gemma 4 lineup: the dense 12B and 31B models, the 26B A4B MoE variant, and the smaller E2B/E4B on-device models. The GGUF versions from Unsloth load into llama.cpp with zero extra tooling. These models maintain close to bfloat16 benchmark scores — 77.2% on MMLU Pro, 77.5% on AIME 2026 for the 12B — while requiring a fraction of the memory.

The 12B QAT model at UD-Q4_K_XL fits in roughly 7GB of VRAM. That leaves 5GB on a 12GB card for KV cache, context, and the MTP draft model. This is the first time a frontier-competitive 12B-class model has fit so comfortably on mid-range consumer hardware.

MTP: The 2x Multiplier

Multi-Token Prediction is the sleeper in this release. Gemma 4 was trained with an auxiliary MTP head — a small, lightweight module that predicts multiple future tokens in parallel during inference. The main model generates a token, and the MTP head drafts up to four candidate continuations. The main model then verifies and accepts the best ones, effectively generating multiple tokens per forward pass.

The llama.cpp PR that merged MTP support unlocked this on consumer hardware.

Without MTP, the Gemma 4 12B QAT on an RTX 4070 Super runs at roughly 60 tok/s — already respectable. With MTP enabled and --spec-draft-n-max 4, that jumps to 120 tok/s aggregate across diverse workloads. The draft acceptance rate hovers around 58–60%, meaning the model accepts roughly 3 out of every 5 speculative tokens — consistent with the aggregate rates reported across multiple hardware configurations in the PR benchmarks. On code generation and summarization, the acceptance rate climbs above 75%, pushing speeds past 130 tok/s.

The llama.cpp PR benchmarks — spanning multiple hardware configurations — tell the story. On the 31B model running across two high-end GPUs (RTX 4090 + RTX 5090), MTP pushes throughput from ~33 tok/s to ~94 tok/s on code tasks — a 2.8x gain. On dual RTX 3090s, the speedup is similar: ~30 tok/s without MTP to ~74 tok/s with it, a 2.5x improvement. MTP accelerates any configuration where the draft model fits in spare VRAM.

For anyone who has dismissed speculative decoding as a theoretical nicety, these numbers demand a reassessment. MTP is not a marginal gain. It doubles throughput on mid-range GPUs with zero quality degradation — the main model still validates every output.

MoE: CPU Inference Without a GPU

The Gemma 4 26B A4B MoE variant changes the hardware conversation entirely. With only 3.8B active parameters out of 25.2B total, this model runs comfortably on systems without a dedicated GPU. An old desktop with 32GB of RAM and a mid-range CPU can serve this model at 7 tok/s — not blazing, but fully usable for chat, coding assistance, and document analysis.

Seven tokens per second was the output of cloud-hosted GPT-3 in 2022. Today it runs on a used desktop workstation — often found for as little as $150 on the secondary market — with no GPU required.

This matters because it collapses the hardware entry point. The MoE architecture means the 26B model's inference speed is determined by its active parameter count, not its total. The 128 experts with 8 active per token keep memory bandwidth the bottleneck — and memory bandwidth on a modern CPU with DDR5 is faster than it was on a 2020-era data center GPU. The gap between "runs on a GPU" and "runs on anything" is narrowing.

For CPU inference, choose K-quants over I-quants — GGUF quantizations like Q4_K_XL outperform IQ4_XS because the I-quant format has some tensors that aren't CPU-friendly. Optimizing the prefill path (for example, by tuning batch sizes or thread counts) can reduce prompt processing latency, though token generation remains bandwidth-bound regardless of the inference engine.

The Cloud Compute Context

Two days before MTP merged into llama.cpp, Google disclosed it will pay SpaceX $920 million per month for access to 110,000 NVIDIA GPUs. That's $11 billion per year for compute capacity alone — not including the $180 billion Alphabet has already committed in capital expenditures for 2026.

Meanwhile, used RTX 3090s with 24GB of VRAM now sell for $1,300–$1,500 on the secondary market — based on current market rates, more than double their price two years ago. The demand for local inference hardware is pushing prices up because more people are discovering that local models are usable.

The inversion is stark. Cloud compute costs are accelerating into the billions per month, while local inference hardware is a one-time purchase that keeps getting more capable with each software update. A $1,400 3090 purchased today can run Gemma 4 31B with MTP at 60+ tok/s. After 18 months of ownership — roughly what Google pays for 0.00015 seconds of its SpaceX cluster — the card is paid off and the inference keeps running.

Common Misconception: "Local is catching up"

The dominant framing of local inference is that it's "catching up" to cloud APIs. This misses the structural shift.

Cloud APIs pay for convenience: no hardware management, instant scaling, access to the largest models. Local inference pays for sovereignty: zero added latency, no data leaving your machine, predictable costs that scale to zero when idle. These are different value propositions, and the quality gap has narrowed to the point where the choice is genuinely situational rather than obvious.

For interactive coding assistance, document processing, code review, and structured data extraction — the workloads that make up most practical LLM usage — a 12B QAT model with MTP on a $600 GPU matches or exceeds the throughput of cloud API calls while eliminating per-token costs and data exposure. The latency is actually better (no network round-trip), and there is no rate limiting.

The Takeaway

Gemma 4 QAT with MTP is not "almost as good as cloud." It is genuinely competitive for a wide range of production workloads, and strictly superior on cost, latency, and privacy dimensions.

If you are building a tool that calls an API for every LLM interaction, benchmark it against Gemma 4 12B QAT with MTP on a consumer GPU. The cloud API should win on convenience and scale. It should not win on speed, quality, or cost per token. And if those dimensions matter more to your use case than elastic scaling, the migration path is now open.

The software stack — llama.cpp, MTP, QAT GGUF quants — is mature enough for production use. The hardware is a single GPU purchase away. The only remaining question is whether your workload fits in 12GB of VRAM. For a growing majority of real-world tasks, it does.

We just crossed a threshold. Local inference went from "interesting experiment" to "deployable alternative" in a single release cycle. The implications for cloud AI pricing, hardware demand, and who gets to build with frontier models are only beginning to unfold.

Gemma 4 QAT + MTP Turned Local Inference Into a Cloud Competitor

What QAT Actually Changes

MTP: The 2x Multiplier

MoE: CPU Inference Without a GPU

The Cloud Compute Context

Common Misconception: "Local is catching up"

The Takeaway

No comments yet

Continue reading

Why AI Coding Agents Prefer Rust: The Compiler as Guardrail

The Integration Ceiling

The Sandbox War: Cloudflare and Vercel Both Solved the Same Infrastructure Blind Spot

Track the tools. Lead the shift.