Google Gemma 4 Review: Benchmarks, Features & How to Run It Locally
Google Gemma 4 is here — Apache 2.0 licensed, #3 globally on Arena AI, and running locally in minutes. This review covers every variant, real benchmark numbers, and step-by-step local setup.
Google Gemma 4 is the most capable open-model family Google has released to date — and for the first time, it ships under the Apache 2.0 license. Released on April 2, 2026, Gemma 4 covers the full deployment spectrum from mobile-edge inference to workstation-class reasoning, with the 31B dense model ranking #3 globally among open models on the Arena AI text leaderboard. This Google Gemma 4 review covers every variant, the real benchmark numbers, system requirements for local deployment, and how to get it running in under five minutes with Ollama.
What Is Google Gemma 4?
Gemma 4 is Google's fourth-generation open-weight language model, built from the same research foundation as Gemini 3. It launched on April 2, 2026 with four distinct size configurations covering edge devices through workstation-class hardware. All four models are natively multimodal — they understand images, text, and (on the two smaller variants) audio, with support for over 140 languages.
Architecturally, Gemma 4 uses alternating local sliding-window and global full-context attention. The workstation models add Per-Layer Embeddings (PLE), a parallel lower-dimensional conditioning pathway that lets each decoder layer modulate hidden states without a full residual stream — the mechanism largely responsible for Gemma 4's strong intelligence-per-parameter ratio.
Why Apache 2.0 Is the Real Headline
Every prior Gemma release shipped with a custom use policy that limited commercial scale. Gemma 4 breaks from that pattern: it is the first Gemma model under the Apache 2.0 license, matching the approach taken by Qwen and Mistral. In practice this means no monthly active user caps, no acceptable-use policy enforcement, and no legal friction for sovereign or enterprise AI deployments. For teams that need a commercially safe open model on-premise, the licensing change alone makes Gemma 4 worth evaluating.
Gemma 4 Model Variants: What E2B, E4B, 26B A4B, and 31B Actually Mean
Google released Gemma 4 in four configurations. The naming is not intuitive, so here is what each identifier actually refers to:
Edge Models: E2B and E4B
The "E" prefix stands for Effective — the number after it is the effective parameter count the model activates during inference. E2B has ~2.3B active parameters; E4B has ~4B. Both are designed for on-device and mobile deployment. They are the only two Gemma 4 variants with native audio input (a USM-style conformer encoder), making them suitable for real-time speech recognition and offline transcription. Context window is 128K tokens.
If you have run Gemma 3n locally before, the E2B and E4B are the conceptual successors — same Effective parameter framing, significantly more capable across the board.
Workstation Models: 26B A4B (MoE) and 31B Dense
The 26B A4B is a Mixture-of-Experts (MoE) architecture. Despite 25.2B total parameters, only 3.8B activate per token during inference — delivering near-31B quality at a fraction of the compute cost in theory. In practice, early community benchmarks noted real-world inference throughput issues at launch; this is a known MoE trade-off (routing overhead, memory bandwidth). For batch inference the 26B A4B is efficient; for interactive, latency-sensitive applications the 31B dense model often performs better in practice.
The 31B is a conventional dense transformer. It requires more VRAM but provides the highest raw quality in the lineup and is the best candidate for fine-tuning. Context window on both workstation models is 256K tokens. Neither processes audio — image and video only.
Gemma 4 Benchmark Results
Arena AI Rankings and ELO Scores
Note: Arena AI ELO (community preference ranking) and BenchLM.ai aggregate scores (task benchmark average) are two different evaluation systems. The table below includes both — read the column header carefully.
- Gemma 4 31B: Arena AI ELO ~1452 (global #3 among open models), BenchLM score 73/100
- Gemma 4 26B A4B: Arena AI ELO ~1441 (global #6 among open models)
- GLM-5 (Reasoning): BenchLM score 82/100 (#1)
- Qwen3.5 397B: BenchLM score 77/100 (#2)
- DeepSeek V3.2: BenchLM score 66/100
- Llama 4 Maverick: BenchLM score 43/100
AIME, GPQA, and LiveCodeBench Results (31B)
- AIME 2026: 89.2% — competitive math olympiad problems
- GPQA Diamond: 84.3% — graduate-level science Q&A
- LiveCodeBench v6: 80.0% — real-world competitive coding
- BigBench Extra Hard: 74.4% — extreme multi-step reasoning
Where Gemma 4 Falls Short vs Llama 4, DeepSeek, and Qwen
Gemma 4 does not compete with the largest Chinese open models on complex reasoning. GLM-5 (Reasoning) and Qwen3.5 397B sit above it, and DeepSeek V3.2-Speciale took gold at IMO, IOI, and ICPC 2026 — a level of multi-step mathematical reasoning Gemma 4 at 31B cannot match.
For context on where the open-source LLM landscape stood before this release, the Gemma 3 vs Qwen 3 comparison traces how these competitive dynamics evolved.
If you need the strongest open-source reasoning model regardless of hardware cost: Qwen3.5 or DeepSeek win. If you need the best model that fits on a single GPU under Apache 2.0: Gemma 4 31B wins.
Key Features and Capabilities
- Native multimodal input: All four models accept images at variable aspect ratio and resolution. The 26B and 31B additionally process video up to 60 seconds at 1fps — practical for analyzing screen recordings, CI dashboard screenshots, or document scans. Example: pass a screenshot of a failing build log to Gemma 4 31B and ask it to diagnose the root cause.
- Native audio (E2B and E4B only): Real-time speech recognition and translation without a separate ASR pipeline. Useful for offline meeting transcription or voice-controlled agent interfaces on-device.
- Function calling as a first-class capability: Unlike models that treat tool-use as a prompt-engineering workaround, Gemma 4's function calling is a first-class training objective — resulting in fewer malformed JSON responses and more reliable tool invocations in agentic loops.
- Long context — 128K / 256K: Edge models handle 128K tokens; workstation models handle 256K. The 31B can ingest an entire mid-size codebase in a single context window. For extreme-context requirements at mono-repo scale, Llama 4 Scout's 10M token window remains the outlier.
- 140+ language support: Natively trained across 140 languages. For multilingual pipelines — contract analysis, customer support, localization — this removes the need for a separate translation step.
- Android Studio integration: Gemma 4 is the recommended local model for Android Studio's Agent Mode, supporting refactoring, new feature generation, and iterative bug fixing directly in the IDE.
Hardware Requirements for Running Gemma 4 Locally
- E2B: 4–6 GB VRAM recommended (Q4); CPU-only possible; any M-series Apple Silicon
- E4B: 6–8 GB VRAM recommended (Q4); CPU-only possible (slower); M2/M3/M4 base or better
- 26B A4B (Q4): 8–12 GB VRAM recommended; CPU possible; M3 Pro / M4 Pro (18 GB+ unified memory)
- 31B Dense (Q4): 20–24 GB VRAM recommended; CPU not recommended for interactive use; M3 Max / M4 Max (36 GB+)
⚠ Unverified: VRAM figures above are community estimates at Q4 quantization; verify against the Ollama model card for your target quantization level before provisioning hardware.
For a platform-by-platform breakdown of running each Gemma 4 variant locally — including Windows, Linux, and macOS specifics — see the full Gemma 4 local deployment guide.
How to Run Gemma 4 Locally
Running Gemma 4 with Ollama (Fastest Path)
Gemma 4 has day-one Ollama support. Install Ollama from ollama.com, then pull and run:
# E4B — good default for most laptops
ollama pull gemma4:e4b
ollama run gemma4:e4b
# 26B A4B MoE (requires ~12 GB VRAM)
ollama pull gemma4:27b
# 31B Dense (requires ~24 GB VRAM)
ollama pull gemma4:31b
ollama run gemma4:31bTo expose Gemma 4 as an OpenAI-compatible API endpoint (for LangChain, agent frameworks, or any OpenAI SDK-compatible client):
ollama serve
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"messages": [{"role": "user", "content": "Explain sliding-window attention in 3 sentences."}]
}'Running Gemma 4 with Hugging Face Transformers
For custom inference logic, fine-tuning pipelines, or integration into existing Python tooling:
pip install transformers accelerate torchfrom transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Verify the exact model ID on huggingface.co/google before use
model_id = "google/gemma-4-e4b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Write a Python function to parse a JSON log file and extract ERROR lines."}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))For 26B or 31B models under memory constraints, add load_in_4bit=True via bitsandbytes. If you have used Gemma 3 or Gemma 3n locally before, the model IDs and quantization trade-offs map directly across generations.
Gemma 4 Pros and Cons
- Pro: Apache 2.0 licensing — removes commercial and sovereign deployment friction
- Pro: 31B ranks #3 globally on Arena AI — strongest single-GPU open model under this license
- Pro: Day-one support across Ollama, llama.cpp, LM Studio, Hugging Face, NVIDIA, AMD
- Pro: Native multimodal and function calling trained as first-class capabilities
- Pro: 256K context on workstation models — handles large codebases in one shot
- Con: Audio input limited to E2B and E4B — the two most capable models cannot process audio
- Con: 26B A4B MoE has community-reported inference latency issues at launch; 31B dense may feel faster for interactive use
- Con: Not competitive with Qwen3.5 397B or DeepSeek V3.2 on frontier multi-step reasoning
- Con: 256K context window, while large, is far below Llama 4 Scout's 10M for extreme-context workloads
Final Verdict: Which Gemma 4 Model Should You Run?
Gemma 4 is the most commercially deployable open model Google has shipped. The Apache 2.0 license, combined with the 31B's #3 global Arena AI ranking and day-one support across every major local inference stack, makes it a strong default for teams running open models in production or locally.
- Laptop or edge device: Run E4B — balances capability, portability, includes audio, and fits on most modern hardware without configuration overhead.
- Best single-GPU performance: Run 31B dense — simpler to operate than the MoE variant and competitive on latency for interactive workloads.
- Android app with local AI: Run E2B or E4B — both are first-party supported in Android Studio Agent Mode.
- Maximum reasoning for research or complex agents: Gemma 4 is not the answer — use Qwen3.5 or DeepSeek families instead.
For a direct comparison of Gemma 4 against its predecessor and the adjacent Gemma 3n architecture, the Gemma 4 vs Gemma 3 vs Gemma 3n breakdown covers every variant with switching guidance.