Google DeepMind dropped Gemma 4 on April 2, 2026. Four model sizes, full multimodal support, Apache 2.0 license, and benchmark scores that make models 20x its size nervous. The smallest variant runs on a Raspberry Pi 5. The largest, a 31B dense model, ranks #3 among all open models on the Arena AI leaderboard.
This isn’t incremental. This is a fundamentally different argument about how big a model needs to be.
I spent the last day pulling apart every technical detail from the Hugging Face integration blog, Google’s official announcement, the NVIDIA edge deployment guide, the DeepMind model card, the Chatbot Arena leaderboard, and 445 comments on Hacker News. This is everything I found.
The Full Gemma 4 Family: Four Models, One Architecture Lineage
Gemma 4 ships in four sizes, all available as both base and instruction-tuned checkpoints:
- E2B: 2.3B effective parameters (5.1B total with embeddings), 128K context, 512-token sliding window
- E4B: 4.5B effective parameters (7.9B total with embeddings), 128K context, 512-token sliding window
- 26B-A4B: Mixture-of-Experts, 3.8B active out of 25.2B total (128 experts, 8 active + 1 shared), 256K context
- 31B: Fully dense, 30.7B parameters, 256K context, 1024-token sliding window
All four handle text, images, and video. The E2B and E4B also support native audio input (speech recognition and understanding, up to 30 seconds). Every model includes native function calling, structured JSON output, system instructions, and thinking/reasoning modes. They support 140+ languages with a 262K-token vocabulary.
The “E” in E2B and E4B stands for “Effective.” These models use Per-Layer Embeddings (PLE), an architecture inherited from Gemma 3n, which makes the total parameter count higher than the effective count. The extra parameters are embedding tables, not decoder weights, so they don’t contribute to inference compute in the same way.
The Apache 2.0 Shift
Every previous Gemma release used a custom Google license. It technically allowed commercial use, but included clauses that let Google restrict usage under certain conditions. Developers building production systems on Gemma models had legal uncertainty baked into their stack.
Gemma 4 is Apache 2.0. Full stop. You can use these models commercially, modify them, redistribute them, create derivative works, merge them with proprietary code, and build products without asking permission or worrying about future license changes. For regulated industries (healthcare, finance, government) where legal review of model licenses can take months, this removes a significant adoption blocker.
The Apache 2.0 decision also opens the door to community modifications that were legally gray before: abliterated/uncensored variants, domain-specific fine-tunes with custom safety profiles, and integration into copyleft projects. At least one uncensored Gemma 4 variant appeared within hours of release.
The Dense 31B: The Model That Changed the Conversation
The Gemma 4 31B is a fully dense transformer. Every one of its 30.7 billion parameters activates on every token. No routing, no expert selection, no sparsity tricks. And it scores an estimated 1452 on Arena AI (text), placing it at #3 among all open-source models worldwide and making it the #1 U.S. open-source model.
To understand why this matters, look at what’s around it on the leaderboard:
- GLM-5 (Z.ai/Zhipu): Arena score 1452, MIT license, estimated ~750B parameters. 24x larger than Gemma 4 31B.
- Kimi-K2.5-Thinking (Moonshot AI): Arena score 1451, Modified MIT. Estimated 1T+ total parameters. 34x larger.
- Qwen 3.5-397B-A17B (Alibaba): Arena score 1450, Apache 2.0. 397B total parameters, 17B active. 13x larger in total params.
- Gemma 4 31B (Google): Arena score 1452, Apache 2.0. 31B dense. Ties or beats all of the above.
Below it: DeepSeek-R1-0528 at 1426, GPT-5-high at 1444, Grok-4 at 1442. These are proprietary API-only models from well-funded labs. A 31B open model matching them is not normal.
Benchmark Deep Dive
Here’s the full benchmark table for the instruction-tuned models with thinking enabled, compared against Gemma 3 27B (the previous generation):
Reasoning and Knowledge:
- MMLU-Pro (multilingual Q&A): Gemma 4 31B 85.2% | 26B-A4B 82.6% | E4B 69.4% | E2B 60.0% | Gemma 3 27B 67.6%
- AIME 2026 (competition math, no tools): 31B 89.2% | 26B 88.3% | E4B 42.5% | E2B 37.5% | Gemma 3 20.8%
- GPQA Diamond (PhD-level science): 31B 84.3% | 26B 82.3% | E4B 58.6% | E2B 43.4% | Gemma 3 42.4%
- Tau2 Bench (agentic tool use, avg of 3): 31B 76.9% | 26B 68.2% | E4B 42.2% | E2B 24.5% | Gemma 3 16.2%
- BigBench Extra Hard: 31B 74.4% | 26B 64.8% | E4B 33.1% | E2B 21.9% | Gemma 3 19.3%
- MMMLU (multilingual): 31B 88.4% | 26B 86.3% | E4B 76.6% | E2B 67.4% | Gemma 3 70.7%
Coding:
- LiveCodeBench v6: 31B 80.0% | 26B 77.1% | E4B 52.0% | E2B 44.0% | Gemma 3 29.1%
- Codeforces ELO: 31B 2150 | 26B 1718 | E4B 940 | E2B 633 | Gemma 3 110
- HLE (Humanity’s Last Exam, no tools): 31B 19.5% | 26B 8.7%
- HLE with search: 31B 26.5% | 26B 17.2%
Vision:
- MMMU Pro (multimodal reasoning): 31B 76.9% | 26B 73.8% | E4B 52.6% | E2B 44.2% | Gemma 3 49.7%
- MATH-Vision: 31B 85.6% | 26B 82.4% | E4B 59.5% | E2B 52.4% | Gemma 3 46.0%
- OmniDocBench 1.5 (edit distance, lower is better): 31B 0.131 | 26B 0.149 | E4B 0.181 | E2B 0.290 | Gemma 3 0.365
- MedXPertQA MM (medical multimodal): 31B 61.3% | 26B 58.1% | E4B 28.7% | E2B 23.5%
Audio (E2B and E4B only):
- CoVoST (speech translation): E4B 35.54 | E2B 33.47
- FLEURS (speech recognition, lower is better): E4B 0.08 | E2B 0.09
Long Context:
- MRCR v2 8-needle 128K (avg): 31B 66.4% | 26B 44.1% | E4B 25.4% | E2B 19.1% | Gemma 3 13.5%
The generation-over-generation jump from Gemma 3 27B to Gemma 4 31B is staggering. AIME goes from 20.8% to 89.2%. LiveCodeBench from 29.1% to 80.0%. Codeforces ELO from 110 to 2150. These aren’t small improvements. These are qualitative leaps into a different capability tier.
Gemma 4 31B vs Qwen 3.5 27B: Head-to-Head
The community wasted no time building side-by-side comparisons. Here’s the consolidated picture from multiple independent testers:
- MMLU-Pro: Gemma 4 85.2% vs Qwen 3.5 86.1% (Qwen wins slightly)
- GPQA Diamond: Gemma 4 84.3% vs Qwen 3.5 85.5% (Qwen wins slightly)
- LiveCodeBench v6: Gemma 4 80.0% vs Qwen 3.5 80.7% (tie)
- Codeforces ELO: Gemma 4 2150 vs Qwen 3.5 1899 (Gemma wins big)
- TAU2-Bench: Gemma 4 76.9% vs Qwen 3.5 79.0% (Qwen wins)
- MMMLU (multilingual): Gemma 4 88.4% vs Qwen 3.5 85.9% (Gemma wins)
- HLE (no tools): Gemma 4 19.5% vs Qwen 3.5 24.3% (Qwen wins)
On raw automated benchmarks, it’s close to a draw with Qwen 3.5 holding a slight edge on reasoning tasks and Gemma 4 winning on competitive coding and multilingual. But the Arena AI ELO (which measures human preference from millions of blind comparisons) favors Gemma 4: 1452 vs 1450. The gap between ELO and automated benchmarks suggests Gemma 4 produces responses that humans actually prefer even when raw accuracy numbers are similar. That’s a training data and RLHF quality signal, not an architecture signal.
One early adopter on dev.to put it well: “The honest take is Gemma 4 ties with Qwen, if not Qwen being slightly ahead. And Qwen 3.5 is more compute efficient too.” But another noted: “Gemma 4 makes translategemma feel outdated instantly” for non-English tasks. The multilingual story is where Gemma 4 clearly separates from the pack.
Architecture: How 31B Parameters Punch This Far Above Their Weight
Gemma 4 doesn’t achieve these numbers by simply scaling up a standard transformer. The Hugging Face integration blog details several architectural innovations that work together to squeeze maximum intelligence out of every parameter. Notably, Google removed complex or inconclusive features like Altup that appeared in earlier Gemma research, favoring a cleaner combination that’s highly compatible across inference libraries.
Dual Attention: Sliding Window + Global Full-Context
Gemma 4 alternates between two types of attention layers:
- Local sliding-window attention: Each token attends only to nearby tokens within a fixed window. The 31B and 26B models use 1024-token windows; the E2B and E4B use 512-token windows. This is cheap because the attention matrix is small.
- Global full-context attention: Standard attention across the entire sequence. Expensive at long contexts, but necessary for capturing long-range dependencies.
By alternating these, the model gets the benefit of full-context awareness (through the global layers) without paying the full quadratic cost on every layer. Each attention type also gets its own RoPE configuration: standard RoPE for sliding layers, proportional RoPE for global layers. The proportional variant enables the 256K context window on the larger models without the positional encoding breaking down at extreme positions.
Shared KV Cache
This is an efficiency optimization that directly reduces both compute and memory during inference. The last num_kv_shared_layers layers of the model don’t compute their own key and value projections. Instead, they reuse the K/V tensors from the last non-shared layer of the same attention type (sliding or full).
In practice, this means the model has fewer distinct KV caches to maintain during generation. For long-context use (the 256K window) and on-device deployment, this is significant. It’s the difference between fitting in memory and not fitting. The quality impact, according to Hugging Face’s testing, is minimal.
One Hacker News commenter pointed out that the 31B’s KV cache behavior isn’t bugged (as some initially suspected) but reflects a static sliding window cost of 3.6GB. With an IQ4_XS quantization at 15.2GB for weights, you’re looking at roughly 64K context on a 24GB GPU, or 100K+ with 8-bit KV quantization after a recent llama.cpp optimization landed.
Per-Layer Embeddings (PLE)
This is the most distinctive architectural feature in the smaller Gemma 4 models (E2B and E4B), inherited from Gemma 3n. Standard transformers give each token a single embedding vector at input. That single vector has to frontload everything the model might need across all layers. It’s a bottleneck.
PLE adds a parallel, lower-dimensional conditioning pathway alongside the main residual stream. For each token, PLE produces a small dedicated vector for every layer by combining two signals:
- A token-identity component: from a second embedding lookup table
- A context-aware component: from a learned projection of the main embeddings
Each decoder layer then uses its corresponding PLE vector to modulate the hidden states via a lightweight residual block after attention and feed-forward. This gives each layer its own channel to receive token-specific information exactly when it becomes relevant, rather than requiring everything to be packed into a single upfront embedding.
Because the PLE dimension is much smaller than the main hidden size, this adds meaningful per-layer specialization at modest parameter cost. This is why the total parameter count (5.1B for E2B) is higher than the effective count (2.3B): the PLE embedding tables add parameters, but they’re cheap at inference time.
For multimodal inputs (images, audio, video), PLE is computed before soft tokens are merged into the embedding sequence. Multimodal positions use the pad token ID, so they receive neutral per-layer signals. The model learns to distinguish text from multimodal content partly through this mechanism.
Vision Encoder
The image encoder uses learned 2D positional embeddings with multidimensional RoPE. Two critical improvements over Gemma 3:
- Variable aspect ratios: No forced square crops. Images are processed at their natural dimensions, preserving information that square cropping destroys.
- Configurable token budgets: The encoder can produce 70, 140, 280, 560, or 1120 image tokens per image. You pick your speed/memory/quality tradeoff per request. A quick thumbnail analysis might use 70 tokens; a detailed OCR task might use 1120.
The Hugging Face team tested all four model sizes on object detection, GUI element detection, image captioning, and HTML reproduction from screenshots. All models performed well, with the 31B producing the most detailed and accurate responses. Even the E2B correctly identified bounding boxes for UI elements and generated passable HTML from screenshots, which is remarkable for a 2.3B-effective model.
Audio Encoder
The E2B and E4B models include a USM-style conformer audio encoder, the same base architecture used in Gemma 3n. It handles speech recognition, speech-to-text transcription, and audio question-answering for clips up to 30 seconds. In Hugging Face’s tests, both models accurately transcribed an Obama speech excerpt and answered questions about a live concert video’s audio content (though the E2B hallucinated some audio details).
The larger 31B and 26B models do not include audio support. They handle video by processing the visual frames without the audio track. This is presumably a parameter budget decision: fitting a full audio encoder into a 31B model would add significant parameters without proportional benefit for the primary text/code/reasoning use case.
The E2B: Frontier AI on a Raspberry Pi 5
The most surprising member of the family is the smallest. Gemma 4 E2B has 2.3 billion effective parameters and fits comfortably on devices with 4GB+ RAM. Google explicitly lists Raspberry Pi, Android phones, and NVIDIA Jetson Orin Nano as target hardware.
Let that sink in. This is a model that:
- Handles text, images, audio, and video input
- Has a 128K token context window
- Supports native function calling for agentic workflows
- Supports 140+ languages
- Includes thinking/reasoning modes
- Runs completely offline with near-zero latency
- Scores 60% on MMLU-Pro and 37.5% on AIME 2026
For comparison, Gemma 3’s full 27B model (which required a beefy GPU) scored 67.6% on MMLU-Pro and 20.8% on AIME 2026. The E2B scores nearly as well on knowledge benchmarks and almost double on math compared to a model that was 12x its size from the previous generation. A 2.3B model from April 2026 outperforms a 27B model from 2025 on competition mathematics. The rate of improvement is dizzying.
Running it is trivial:
ollama run gemma4:e2b
That’s a 7.2GB quantized download (Q4_K_M) that gives you a multimodal AI with native function calling on an $80 single-board computer. Or on the phone in your pocket. Or in a browser tab via WebGPU.
What You Can Actually Do With the E2B
The Hugging Face blog includes extensive tests across modalities. Here’s what works out of the box with the smallest model:
Object detection and GUI pointing: Give it a screenshot and ask “What’s the bounding box for the ‘view recipe’ element?” It responds in JSON format with coordinates, no special prompting needed. The coordinates refer to a 1000×1000 image space relative to input dimensions.
Audio transcription: Feed it an MP3 and ask for a transcription. The E2B produced: “This week I traveled to Chicago to deliver my final farewell address to the nation following in the tradition of presidents before me It was an opportunity to say thank you whether we’ve seen eye to eye or rarely agreed at all…” Clean, accurate, no punctuation quirks.
Video understanding: The E2B correctly identified a live concert performance from video, including the setting (outdoor festival), performers, stage setup, and attempted to describe the lyrics, though it hallucinated some audio details. The E4B handled this more accurately.
Image captioning: “A medium shot captures a weathered seagull perched atop a stone pedestal in what appears to be a bustling European square, with a grand, classical-style building featuring ornate columns and architectural details dominating the right side of the frame.” That’s from the 2.3B model. Accurate, detailed, properly structured.
Multimodal function calling: Give it an image of a Thai temple and the prompt “What is the city in this image? Check the weather there right now” alongside a get_weather tool definition. The E2B correctly identifies Bangkok from the architectural style, reasons through its approach step by step, and generates the correct function call: get_weather(city="Bangkok").
Raspberry Pi 5 Deployment
The Raspberry Pi 5 has 4GB or 8GB RAM variants, a Broadcom BCM2712 quad-core Cortex-A76 at 2.4GHz, and no GPU in the traditional sense. Running a quantized E2B model (Q4_K_M at ~3.5GB in memory) on the 8GB Pi 5 leaves enough headroom for the OS and a reasonable context window.
Google and NVIDIA explicitly target this deployment. NVIDIA’s Jetson AI Lab provides containers and tutorials for running Gemma 4 E2B and E4B on Jetson Orin Nano, which has a comparable hardware profile to the Pi 5 but with a small GPU. The architecture features that enable this are the Per-Layer Embeddings (which can be cached for faster loading and reduced memory use) and the shared KV cache (which reduces memory overhead for context).
The implications for IoT, robotics, and edge computing are significant. A security camera with a Raspberry Pi can now do multimodal reasoning on its own video feed. A smart home hub can understand voice commands and process images from cameras without ever connecting to the internet. An agricultural drone can classify crop health from aerial photos using onboard compute.
iPhone and Android: Google AI Edge Gallery
Four days after Gemma 4 launched, Google shipped the Google AI Edge Gallery, a free app that runs Gemma 4 E2B entirely on your iPhone or Android phone. No cloud. No API key. No Google account required. The model downloads once (~3.5GB) and runs locally using Google’s LiteRT-LM runtime.
The app hit 719 points and 201 comments on Hacker News within 18 hours. And for good reason: it doesn’t just run a chatbot. It includes mobile actions, tool calls that let the on-device model trigger native phone functions. Turn on the flashlight, open Maps to a location, set a timer. All decided by the LLM, all executed locally.
Performance is surprisingly usable. Google’s official benchmarks show 56 tokens/sec on iPhone and 52 tok/s on Qualcomm. Community reports vary by device: ~40 tok/s via MLX on recent iPhones, 29 tok/s on a Samsung Galaxy S25 edge, and ~12 tok/s on an iPhone 14 (older A15 chip). One user built a real-time audio/video-in, voice-out demo on a MacBook with the E2B and estimated the same pipeline could run on an iPhone 17 Pro.
Early testers on HN are already imagining what this enables. One developer building privacy-first apps for teachers called it exactly what they need: local AI that respects stringent education privacy laws without any data leaving the device. Another noted that Apple is reportedly working with Google to integrate Gemma into a future version of Siri. The original poster summed it up: “This gives me hope for a future Siri, ‘Her’ style.”
The app is also available on Android, and it’s part of Google’s broader AI Edge initiative. Currently it runs on the GPU, but NPU support (Apple’s Neural Engine has 35 TOPS vs the GPU’s 7 TFLOPS) could bring another significant speed boost.
Before You Try: Check canirun.ai
Before attempting to run any Gemma 4 variant locally, check canirun.ai. The site detects your GPU, VRAM, memory bandwidth, RAM, and CPU cores, then shows exactly which models and quantization levels your hardware can actually run. It lists all Gemma variants with memory requirements at every quantization level from Q2_K through F16.
This is particularly valuable for Gemma 4 because the memory requirements vary dramatically across the family:
- E2B Q4_K_M: ~3.5GB (runs on a Raspberry Pi 5 8GB)
- E4B Q4_K_M: ~5GB (runs on phones with 8GB RAM)
- 26B-A4B Q4_K_XL: ~14GB (fits on a 16GB GPU with limited context)
- 31B Q4_K_XL: ~18GB (needs a 24GB GPU, tight on context)
- 31B BF16: ~62GB (needs an 80GB H100 or equivalent)
The difference between “runs fast” and “swaps to death” can be a few hundred megabytes of VRAM. canirun.ai takes the guesswork out. It also recommends Ollama (version 0.6+) as the default runtime and shows the exact commands to get started.
The MoE 26B-A4B: The Speed Variant
The 26B Mixture-of-Experts model is Gemma’s first MoE release. It activates only 3.8 billion parameters per token out of 25.2B total, using 128 experts with 8 active plus 1 shared expert per forward pass. It scores 1441 on Arena AI, just 11 points behind the dense 31B, ranking #6 among open models.
The 26B is designed for latency-sensitive interactive use. On a MacBook Air M4 with 32GB RAM, users report running the Q4_K_XL quantization with 32K context comfortably. One user on HN benchmarked a 24GB RX 7900 XTX and got impressive results:
llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
| Context | TG tok/s |
|---------|-----------|
| 1K | 120.29 |
| 2K | 119.04 |
| 4K | 117.08 |
| 8K | 114.87 |
| 16K | 107.65 |
| 32K | 100.12 |
| 64K | 88.12 |
| 128K | 71.25 |
100+ tokens/sec at 32K context and 71 tok/s at the full 128K on a single consumer GPU. That’s fast enough for real-time interactive use, even at very long contexts. The prompt processing speed is even more impressive: 2,650 tokens/sec at 2K context, staying above 960 tok/s even at the full 128K.
Both the 31B and 26B unquantized bfloat16 weights fit on a single 80GB NVIDIA H100. Quantized versions run on consumer GPUs from the RTX 4060 up.
Multimodal Capabilities: What Actually Works
The Hugging Face team ran extensive tests across all modalities. Here’s a detailed breakdown of what works at each model size.
HTML Generation From Screenshots
They gave each model a screenshot of a landing page and asked it to “Write HTML code for this page” with thinking enabled and 4000 max tokens. All four models produced functional HTML. The 31B and 26B versions were nearly pixel-perfect reproductions. Even the E2B produced a recognizable approximation. This has obvious applications for design-to-code workflows and automated UI testing.
Video Understanding
The smaller models (E2B, E4B) process video with audio. The larger models process video frames only (no audio track). Despite not being explicitly post-trained on video data, all models demonstrate video comprehension. The E4B correctly described a concert scene including the song’s lyrical themes (“struggles and disillusionment of modern life, specifically the feeling of being stuck”). The 31B, working without audio, still accurately described the visual scene, performer actions, and stage setup.
Multimodal Function Calling With Thinking
This is where Gemma 4 gets interesting for AI agent builders. You can define tools, show the model an image, and ask it to use the tools based on what it sees. The thinking trace shows the model’s reasoning chain: “Analyze the image… identify the landmark… determine the city is Bangkok… formulate tool call.” Even the E2B follows this pattern, though its reasoning is more verbose.
Combined with the structured JSON output and native system instructions, this makes Gemma 4 a production-ready foundation for multimodal agents. The fact that function calling works on a model small enough to run on a phone means you can build agents that operate entirely offline.
Inference Ecosystem: Day-One Support Everywhere
One of Gemma 4’s strongest launch stories is the breadth of day-one integration. Google clearly invested in ensuring every major inference engine worked at release.
llama.cpp
Image+text support from day one. Start an OpenAI-compatible server with one command:
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF
This works with LM Studio, Jan, and coding agents like Pi. Quantized GGUF checkpoints from both the official ggml-org and Unsloth are available. Unsloth’s “Dynamic 2.0” quantizations are particularly interesting: they analyze every layer and selectively adjust quantization type per layer, using a hand-curated calibration dataset of 1.5M+ tokens. Users report these deliver better quality than standard GGUF quants at the same bit width.
Coding Agent Configs
The Hugging Face blog includes complete configuration files for four coding agents:
- Hermes: just run
hermes modelafter starting the llama.cpp server - OpenClaw: run
openclaw onboard - Pi: define
~/.pi/agent/models.jsonpointing tolocalhost:8080/v1 - Open Code: define
~/.config/opencode/opencode.jsonwith the OpenAI-compatible provider
This means you can turn your laptop into a fully local AI coding assistant. No API keys, no cloud dependency, no data leaving your machine. Combined with the 256K context window on the 31B, you can pass entire repositories in a single prompt.
MLX (Apple Silicon)
Full multimodal support via mlx-vlm. The notable feature: TurboQuant, which delivers the same accuracy as uncompressed baseline while using ~4x less active memory and running significantly faster end-to-end. This makes long-context inference practical on Macs without sacrificing quality:
mlx_vlm.generate
--model "mlx-community/gemma-4-26B-A4B-it"
--prompt "Your prompt here"
--kv-bits 3.5
--kv-quant-scheme turboquant
transformers (Python)
The new AutoModelForMultimodalLM class and the any-to-any pipeline make inference straightforward. The built-in chat template handles formatting, thinking mode, audio extraction from video, and tool definitions. This is the path for fine-tuning with TRL, PEFT, and bitsandbytes.
WebGPU
The E2B runs in the browser via transformers.js. Hugging Face shipped a working demo. This opens up entirely new deployment patterns: AI applications that run client-side with zero backend infrastructure.
NVIDIA Infrastructure
NVIDIA published deployment guides across their full hardware stack:
- DGX Spark: GB10 Grace Blackwell Superchip with 128GB unified memory. Runs the 31B in BF16 natively. Includes NeMo Automodel recipes for fine-tuning.
- Jetson Orin Nano: E2B and E4B for robotics and edge AI. Containers available at Jetson AI Lab.
- RTX GPUs: Quantized inference via Ollama and llama.cpp. RTX Pro users can also use vLLM.
- NVIDIA NIM: Pre-packaged optimized microservices for production deployment. Free API available for prototyping at build.nvidia.com.
- NVFP4: 4-bit quantized checkpoint for Blackwell GPUs using NVIDIA Model Optimizer. 4-bit precision with near-identical accuracy to 8-bit.
Fine-Tuning: Day One, With Caveats
Fine-tuning support exists but had a rough launch. Multiple early adopters reported issues within hours:
- HuggingFace Transformers didn’t recognize the
gemma4architecture at first. Required installing from source. - PEFT couldn’t handle
Gemma4ClippableLinear, a new layer type in the vision encoder. Required a monkey-patch. - A new
mm_token_type_idsfield is required during training even for text-only data. Required a custom data collator.
All three issues had bug reports filed and responses within hours. Unsloth Studio provided day-one fine-tuning support with a UI that handles these edge cases. The HuggingFace TRL team also shipped a remarkable demo: training Gemma 4 to drive in the CARLA driving simulator, where the model sees the road through a camera, decides actions, and learns from outcomes. After training, it consistently changes lanes to avoid pedestrians.
For production fine-tuning, NVIDIA NeMo Automodel provides recipes for supervised fine-tuning (SFT) and memory-efficient LoRA directly from HuggingFace checkpoints, no format conversion needed.
What This Means for Qwen, Llama, and the “Bigger Is Better” Paradigm
The open model landscape in early April 2026 presents a stark efficiency picture. Here are the top open models by Arena AI score:
- GLM-5 (Z.ai): 1452, ~750B params, MIT
- Kimi-K2.5-Thinking (Moonshot): 1451, ~1T+ params, Modified MIT
- Gemma 4 31B (Google): 1452, 31B dense, Apache 2.0
- Qwen 3.5-397B-A17B (Alibaba): 1450, 397B MoE (17B active), Apache 2.0
- Qwen 3-235B-A22B-Instruct: 1418, 235B MoE (22B active), Apache 2.0
- DeepSeek-R1-0528: 1426, unknown size, MIT
Gemma 4 31B matches or beats every one of them with 10x to 30x fewer total parameters. The implications ripple through every layer of the AI stack.
Serving Cost and Infrastructure
Running a 31B dense model costs a fraction of serving a 397B MoE. Less VRAM, less bandwidth, less electricity, less money per token. The 31B fits on a single H100 in BF16; a 397B model needs a multi-GPU setup even with quantization. For companies deploying at scale, this is potentially the difference between a viable business model and a money-losing one.
This is especially relevant as the cost curve for frontier inference gets more attention. If two models produce equivalent output quality, the one that’s 13x smaller wins on unit economics every time.
Fine-Tuning Accessibility
You can fine-tune a 31B model with LoRA on a single consumer GPU (24GB VRAM is sufficient for Q4 + LoRA). Fine-tuning a 397B MoE requires a multi-GPU cluster with hundreds of gigabytes of aggregate VRAM. This isn’t just a cost difference; it’s an accessibility difference. Researchers at universities, startups without GPU clusters, and individual developers can now fine-tune a frontier-quality model on their workstation. That wasn’t possible with the larger models.
Edge Deployment
No MoE model with 397B total parameters is running on a phone. Period. Even with only 17B active parameters, the 397B model needs all its expert weights loaded (or frequently swapped) in memory. Gemma 4’s E2B runs on a phone, a Raspberry Pi, a Jetson Nano, and in the browser. The E4B runs on any mid-range laptop. The 26B MoE runs on a gaming GPU. The 31B dense runs on a workstation.
This size diversity means a single model family can serve every deployment tier, from cloud to edge, with consistent behavior and fine-tuning compatibility.
The Speed Counterargument
Gemma 4 is not without weaknesses. Community benchmarks reveal significant performance gaps against Qwen 3.5:
- Token generation speed: One user measured 11 tok/s for Gemma 4 26B-A4B vs 60+ tok/s for Qwen 3.5 35B-A3B on the same RTX 5060 Ti 16GB. The dense 31B achieves 18-25 tok/s on dual NVIDIA GPUs. This is reasonable but not fast.
- VRAM for context: One test showed Gemma 4 27B Q4 fitting only 20K context on a 5090, while Qwen 3.5 27B Q4 fit 190K on the same card. The KV cache is hungrier.
- Overall compute efficiency: Qwen 3.5-397B activates only 17B parameters per token (45% of Gemma 4’s 31B). At equivalent quality, Qwen 3.5 does less computation per token.
These are real issues for production deployments where latency and throughput matter. The Gemma 4 MoE model (26B-A4B) theoretically addresses the compute concern by activating only 3.8B parameters, but the current inference speed doesn’t reflect that advantage in practice.
Possible explanations: immature quantization support (QAT models haven’t been released yet), inference engine optimizations haven’t caught up to the new architecture, or the shared KV cache and PLE add overhead that offsets the parameter reduction. Google’s recommended sampling parameters (temperature 1.0, top_p 0.95, top_k 64) may also impact throughput.
The Trajectory Argument
Speed and VRAM issues tend to improve with time. Quantization-aware training models (QAT) arrived weeks after Gemma 3 and dramatically improved quantized inference quality. The inference engine ecosystem (llama.cpp, vLLM, SGLang) constantly optimizes for popular models. Architecture-specific optimizations for the shared KV cache and PLE are presumably in progress.
What’s harder to fix is raw intelligence per parameter. If Google’s training recipe can extract 1452-ELO performance from 31B parameters today, the question becomes: what happens when they apply the same recipe to 100B parameters? Or 200B? The linear extrapolation is uncomfortable for every other lab in the open model race.
The Community Verdict After 24 Hours
The Hacker News thread hit 1,678 points and 445 comments in a day. Here are the standout takes.
The Enthusiasts
Daniel Hanchen (Unsloth founder), who works closely with every major model lab, called Gemma 4 “sooooo good!!!” and said Google is “definitely hands down” the most collaborative lab to work with, followed by Qwen, Meta, and Mistral. When asked which open source model is best, he answered: “Tbh Gemma-4 haha.”
A user running OCR, full text search, embedding, and summarization of 1800s land records locally described the impact: “People are so excited that they can now search the records in multiple languages that a 1 minute wait to process the document seems nothing.” This was using previous-generation Qwen models; they planned to upgrade to Gemma 4 immediately.
Another user testing Nix code generation found Gemma 4 26B-A4B “significantly better than qwen3.5-35b-a3b” and shared their llama-cli configuration for a MacBook Air M4 32GB.
The Skeptics
A Hugging Face commenter called it a “somewhat disappointing release” and wished “larger tech companies would take a page out of OpenAI’s book and release actually competitive OSS.” They added: “But the models are stable. Definitely more consistency and token efficiency than Qwen.”
Multiple users noted the inference speed gap and questioned whether the Arena AI scores tell the full story. One dev.to analysis concluded: “For English-only, benchmark-optimized, speed-critical deployments, Qwen 3.5 is still the better choice.”
The Missing Models
Several commenters expressed disappointment about what wasn’t released:
- No 9-12B dense model. Gemma 3’s 12B was popular, and there’s no direct upgrade path. The gap between E4B (4.5B effective) and the 26B MoE is too wide for many use cases.
- No 100B+ model. There were rumors of a 120B that didn’t materialize. Several users noted this would have been transformative given the 31B’s performance.
- No QAT models at launch. Gemma 3’s QAT variants significantly improved quantized quality. The community expects these for Gemma 4 but they’re not available yet.
Real-World Use Cases Emerging
Within 24 hours, people were already deploying Gemma 4 for:
- Historical document OCR and translation (land records from the 1800s, multiple languages)
- Local PDF analysis with n8n workflow automation and Ollama
- Code generation as a local coding assistant (Nix, Python, general purpose)
- Kid education platforms (one user said “this exactly fits our kid education domain”)
- Privacy-sensitive document processing where cloud APIs require redaction
- Financial analysis (one user reported the E2B gives “significantly better answers than Qwen 3.5 4B” for finance)
What Comes Next
The community is waiting for several things that will determine whether Gemma 4 consolidates its position or gets overtaken:
QAT models (quantization-aware training): These arrived weeks after Gemma 3 and dramatically improved quantized inference. The E2B and E4B will benefit most, since edge devices rely entirely on quantized inference. Expect these within days to weeks.
Inference engine optimizations: The current speed gap against Qwen 3.5 may close as llama.cpp, vLLM, and other engines add architecture-specific optimizations for the shared KV cache, PLE, and MoE routing.
A 9-12B dense model: This would fill the biggest gap in the lineup and give Gemma 3 12B users a clean upgrade path.
Abliterated variants: Now fully legal under Apache 2.0. At least one exists already. More will follow as safety researchers and uncensored-model communities work through the architecture.
The bigger question: if Google can match 400B MoE models with a 31B dense model, what does the next generation look like? A 100B Gemma 5 using this training recipe could potentially match proprietary frontier models like Gemini 3 Pro (Arena 1492) or Claude Opus 4.6 (Arena 1490). That would be the first time an open model genuinely competed with the best proprietary offerings.
The Bottom Line
Gemma 4 is the most intelligence per parameter ever shipped in an open model. The 31B dense model ties the Arena AI leaderboard with models 10-30x its size. The E2B puts multimodal AI on an $80 Raspberry Pi. The entire family is Apache 2.0, which means no licensing landmines for commercial use.
It’s not perfect. Inference speed lags behind Qwen 3.5. VRAM consumption for context is worse. Fine-tuning tooling needed patches at launch. There’s no 9-12B model in the lineup.
But the trajectory is clear. The open model race just shifted from “who can train the most parameters” to “who can extract the most intelligence per parameter.” Google, drawing from the same Gemini 3 research that powers their proprietary models, appears to be winning that race convincingly.
For developers, the practical upshot: you can now run a frontier-quality multimodal AI model on your laptop, your phone, or a $80 single-board computer. It understands images, audio, video, and 140 languages. It can call functions, reason through multi-step problems, and generate production-quality code. It’s free, it’s open, and it’s Apache 2.0.
That combination didn’t exist 48 hours ago.
