How to Run Gemma 4 with Ollama: Step-by-Step Setup Guide
Google's Gemma 4 is one of the best open-weight model families available in 2026 — and running Gemma 4 with Ollama is the fastest way to get it working locally. In this guide you'll complete the full Gemma 4 Ollama setup: install Ollama, pull the right model size for your hardware, configure context, and hit the local REST API. Everything works on Mac, Linux, and Windows.
If you want a full model overview before diving into setup, see our guide on running Gemma 4 on your PC and devices locally, which covers all available run methods including LM Studio and direct Python inference.
What Is Gemma 4?
Gemma 4 is Google's fourth generation of open-weight language models. The family spans four sizes — two compact edge models (E2B, E4B) and two larger variants (26B MoE, 31B Dense) — all supporting vision input, native function calling, and a massive context window (128K–256K tokens). Ollama wraps local inference into a single command-line tool and a local REST server, making it the lowest-friction path to run Gemma 4 locally.
Deciding which model generation to use? The Gemma 4 vs Gemma 3 vs Gemma 3n comparison breaks down what changed across generations and which variant to pick for your use case.
Hardware Requirements
Pick your variant based on available VRAM (GPU) or RAM (Apple Silicon / CPU-only):
- gemma4:e2b — 2.3B effective params, ~7.2 GB download, 4 GB VRAM min, 128K context. Best for low-end laptops and Raspberry Pi 5.
- gemma4:e4b (default) — 4.5B effective params, ~9.6 GB download, 6 GB VRAM min, 128K context. Recommended starting point for most developers.
- gemma4:26b — 3.8B active params (MoE), ~18 GB download, 8 GB VRAM min, 256K context. High quality on mid-range GPU (RTX 3080+).
- gemma4:31b — 31B dense, ~20 GB download, 20 GB VRAM min, 256K context. Best for RTX 4090 or M-series Mac with 32 GB+.
Apple Silicon: M1/M2/M3/M4 Macs use unified memory, so the VRAM figures map to your RAM. A 16 GB M2 Mac runs E4B comfortably; 32 GB handles the 26B MoE.
AMD GPUs: AMD announced day-0 support for all Gemma 4 variants — ROCm 6.x with an RX 7900 or better is the recommended configuration.
Step 1 — Install Ollama
macOS
# Download from ollama.com (recommended), or via Homebrew:
brew install ollamaLinux
curl -fsSL https://ollama.com/install.sh | shWindows
Download the .exe installer from ollama.com and run it. The Ollama service starts automatically and listens on port 11434.
Confirm the install:
ollama --versionStep 2 — Pull a Gemma 4 Model
The default tag fetches E4B (the recommended starting point). Download sizes are large, so run this on a good connection:
# Default — downloads E4B (~9.6 GB)
ollama pull gemma4
# Specific variants:
ollama pull gemma4:e2b # ~7.2 GB — smallest, runs on almost anything
ollama pull gemma4:26b # ~18 GB — MoE, strong quality on mid-range GPU
ollama pull gemma4:31b # ~20 GB — dense, best quality, high hardware barVerify the download completed:
ollama listThe 26B uses a Mixture of Experts architecture — only 3.8B parameters activate per inference pass, making it faster than its 18 GB size suggests. It is the best choice if you have a 24 GB GPU and want maximum output quality.
Step 3 — Run Gemma 4 and Test It
Start an interactive session:
ollama run gemma4At the prompt, test the model:
>>> Explain transformer attention in two sentences for a software engineer.Exit the session:
>>> /byeRun a single prompt non-interactively (useful for scripting):
ollama run gemma4 "Write a Python function that flattens a nested list."Gemma 4 Ollama Setup: Configuration Options
The most important configuration step most guides skip: Ollama's default context window is 4K tokens, not 128K. This significantly limits Gemma 4's real capability. Always set num_ctx explicitly.
Temporary — set context in an interactive session
>>> /set parameter num_ctx 32768Common values: 16384 (16K), 32768 (32K), 131072 (128K — full E4B capability). Larger context consumes more VRAM for the KV cache, so start at 32K and increase if needed.
Permanent — use a Modelfile
Create a file named Modelfile:
FROM gemma4
PARAMETER num_ctx 32768
PARAMETER num_gpu 99num_gpu 99 tells Ollama to offload all model layers to GPU — 99 means "as many layers as are available." On CPU-only setups, omit this line.
Build and run your custom config:
ollama create gemma4-32k -f Modelfile
ollama run gemma4-32kThis persists across sessions — you will see gemma4-32k in ollama list alongside the base model.
Using the Ollama API with Gemma 4
Ollama exposes a REST API at http://localhost:11434. Use this to integrate Gemma 4 into scripts, tools, and applications.
Generate (non-streaming)
curl http://localhost:11434/api/generate -H "Content-Type: application/json" -d '{
"model": "gemma4",
"prompt": "Summarize why local AI inference matters for developer privacy.",
"stream": false
}'Chat — OpenAI-compatible format
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gemma4",
"messages": [
{"role": "user", "content": "What are the top 3 use cases for a 128K context window?"}
]
}'Python
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma4",
"prompt": "List five practical uses for local LLM inference.",
"stream": False,
},
)
print(response.json()["response"])The /v1/chat/completions endpoint is OpenAI-compatible, meaning existing tools and libraries that target the OpenAI SDK work against your local Ollama instance with a base URL change. For building full agent workflows on top of Ollama, see our OpenClaw + Ollama agent setup guide.
Troubleshooting Common Issues
GPU not detected — model falls back to CPU
A known issue with some Gemma 4 builds causes Flash Attention to misreport GPU usage. Check whether GPU is active:
ollama psIf the GPU column is empty despite having a compatible GPU, add PARAMETER num_gpu 99 to your Modelfile and recreate the model. On Linux, verify CUDA is visible with nvidia-smi.
Out of memory (OOM) error
You are targeting a model larger than your available VRAM. Switch to a smaller tag (gemma4:e4b instead of gemma4:26b), or reduce the context window: /set parameter num_ctx 8192.
Ollama service not responding
# Start manually (Mac/Linux):
ollama serve
# Confirm it is listening:
curl http://localhost:11434Slow inference on Apple Silicon
Ensure you are using a recent Ollama release, which enables Apple's MLX framework for M-series chips. Run ollama --version and update from ollama.com if you are on an older release.
Pull interrupted / model not found after pull
Re-run ollama pull gemma4 — Ollama automatically resumes interrupted downloads from where they stopped.
Performance Tips
- Default quantization is Q4_K_M: This is a good balance of speed and quality. Only switch to Q8 or bf16 if you have GPU headroom and need higher numerical precision.
- Keep the model loaded: Set the environment variable
OLLAMA_KEEP_ALIVE=-1to prevent Ollama from unloading the model after 5 minutes of inactivity. On Linux with systemd, add it viasystemctl edit ollamaas an environment override. On Mac and Windows, set it in your shell profile. - Partial GPU offload: If VRAM is tight, use
num_gpu 20(or any value less than the total layer count) to offload part of the model to GPU and handle the rest on CPU. - Right-size your context: A 128K context window is VRAM-hungry due to the KV cache. Start at 32K and only increase when your use case genuinely requires long documents or large codebases in context.
Next Steps: Agents and Integrations
Your Gemma 4 Ollama setup gives you a local inference endpoint ready to power tools, agents, and applications. Here is where to go next:
- Compare Gemma 4 against other open-source options: see our Gemma vs Qwen in-depth comparison to understand how the model families differ on benchmarks and practical tasks.
- Interested in Google's efficient model line? Read the guide on installing and running Gemma 3n locally to understand the N-series architecture and how it compares.
- Build agent workflows on top of your Ollama server: the OpenClaw + Ollama setup guide walks through a complete local AI agent stack.
The E4B model at 32K context covers the majority of developer use cases. Bump to the 26B MoE when you need higher quality and have the hardware for it. And remember: the single configuration step that unlocks Gemma 4's real capability is setting num_ctx — do not leave it at the 4K default.