Gemma 4

Run Gemma 4 on Your PC and Devices Locally

Q: Do I need a GPU to run Gemma 4?

No. Smaller Gemma 4 variants like E2B and E4B can run on CPUs using runtimes such as Ollama or LiteRT-LM, although performance will be slower. A GPU is recommended for faster inference, especially when using larger models like 26B or 31B.

Q: Is Gemma 4 free for commercial use?

Yes. Gemma 4 is released under the Apache 2.0 license, which allows commercial use, modification, and redistribution within its terms. However, cloud providers hosting Gemma models may still charge for usage.

Q: How much RAM or VRAM do I need?

Smaller models like E2B and E4B can run with approximately 3 to 4 GB of GPU memory or equivalent system RAM when using quantization. Larger models require significantly more memory. Always refer to your runtime documentation for exact requirements based on quantization and batch size.

Q: Can Gemma 4 work offline?

Yes. After downloading the model weights and required runtime, Gemma 4 can run fully offline without an internet connection. This is a major advantage for privacy and local deployment scenarios.

Q: How does Gemma 4 compare to Llama 3.1 or Qwen2.5?

Gemma 4 provides competitive or better performance in reasoning and coding tasks at similar or smaller parameter sizes, along with Apache 2.0 licensing and strong support for edge deployments. Llama 3.1 and Qwen2.5 remain strong alternatives, particularly within their ecosystems, but may not focus as heavily on unified local agent workflows.

Learn how to install, run, and benchmark Gemma 4 locally on PC, Mac, and edge devices with clear steps and real data.

John Walter

Apr 3, 2026 • 13 min read

Run Gemma 4 on Your PC and Devices Locally

Gemma 4 is Google’s newest open AI model and successor of Gemma 3 and Gemma 3n, Google's open AI model family that works well on local hardware, from phones to PCs.

You can run it on your own machine, keep data on device, and avoid cloud latency.
This guide explains what Gemma 4 is, how to install it, and how to run it in practice.

What Is Gemma 4?

Gemma 4 is an open family of language models from Google DeepMind, released under the Apache 2.0 license.

“Open weights” means you can download the model files, run them on your own hardware, and fine‑tune them for your use cases within the license terms.
The family has four main sizes: Gemma 4 E2B, E4B, 26B A4B (Mixture‑of‑Experts), and 31B (dense).

E2B and E4B target phones, Raspberry Pi, and small PCs, while 26B and 31B target laptops with GPUs, workstations, and servers. Gemma 4 supports text and images for input, and it outputs text.

The smaller models support context windows up to 128K tokens, and the larger ones reach up to 256K tokens, which helps with long documents and codebases.
The models handle more than 140 languages and focus on reasoning, coding, and general assistant tasks.

Gemma 4 uses both dense and Mixture‑of‑Experts architectures, which trade off quality and speed in different ways. Google designed Gemma 4 to work as a local “agent” stack, not only as a plain chat model.

Key Features

Open Apache 2.0 license
You can use Gemma 4 commercially, redistribute it, and modify it, under a standard Apache 2.0 open‑source license.
Four model sizes for many devices
E2B and E4B focus on phones, Raspberry Pi, and small PCs, while 26B A4B and 31B target GPUs and servers.
Long context windows
E2B and E4B support up to 128K tokens; 26B A4B and 31B support up to 256K tokens for long documents and code.
Multimodal input
Gemma 4 accepts text and images and supports tasks like document parsing, UI screenshots, charts, OCR, and handwriting recognition.
Strong reasoning and coding performance
The 31B model scores around 85.2 percent on MMLU Pro and 80.0 percent on LiveCodeBench v6, which are advanced reasoning and coding benchmarks.
Edge‑ready runtime stack
LiteRT‑LM, Android AICore, and AI Edge Gallery provide ready runtimes and examples for phones, Raspberry Pi, Jetson, and PCs.
Local agent support
Built‑in function calling and tool use support let Gemma 4 act as the core of local agents, which can call APIs or local tools and then answer.
Broad ecosystem support
Gemma 4 ships with support in Hugging Face Transformers, llama.cpp‑style runtimes, WebGPU demos, and NVIDIA’s RTX and Jetson stacks.

How to Install or Set Up

Choose a local runtime

You have three main paths for local use:

Ollama desktop runtime (Windows, macOS, Linux) for chat and simple APIs.
Python with Hugging Face Transformers for script and app integration.
LiteRT‑LM CLI for edge and agent workflows on Linux, macOS, Windows via WSL, and Raspberry Pi.

Pick one path based on your skills and target device.

Path 1: Install Gemma 4 with Ollama

Ollama is a desktop app and CLI that downloads and runs models for you.

Go to the Ollama website and download the installer for Windows, macOS, or Linux.
Install Ollama and restart your system if prompted, so the service can start.
Open a terminal and run ollama --version to confirm that Ollama works.
In the same terminal, pull the default Gemma 4 model with:
ollama pull gemma4.
After the download finishes, run ollama list to see available Gemma 4 variants and tags.

Ollama exposes different Gemma 4 sizes as tags:

gemma4:e2b for the small edge model.
gemma4:e4b for the edge model with more capacity.
gemma4:26b for the 26B Mixture‑of‑Experts model.
gemma4:31b for the 31B dense model.

Choose E2B or E4B if you have a laptop with shared memory or a low‑end GPU.
Use 26B or 31B only if you have at least 24 GB of GPU memory or a strong workstation.

Path 2: Install Gemma 4 with Hugging Face Transformers

Use this path if you want Python control, custom prompts, or integration into your own app.

Install a recent Python 3.10+ environment.
Install PyTorch with GPU support if you have a compatible NVIDIA or Apple GPU.
Install Transformers and related tools:
pip install -U transformers torch accelerate.
Log in to Hugging Face if the repo requires an access token.
In your script, load a Gemma 4 repo, such as google/gemma-4-E2B or google/gemma-4-31B.

The Gemma 4 E2B Hugging Face page includes ready example code for chat prompts and image inputs.

You pass a message list into a processor, create tensors, call model.generate, and then parse the response with the same processor.

Path 3: Install Gemma 4 with LiteRT‑LM CLI

LiteRT‑LM is Google’s open‑source inference framework for edge LLMs.
Its CLI makes it easy to run models from a terminal, with no extra code.

Install LiteRT‑LM CLI for your platform using the instructions in the official docs.
Make sure Python and required system libraries are present, as the CLI may rely on them.
Download a pre‑converted Gemma 4 LiteRT‑LM model, such as litert-community/gemma-4-E2B-it-litert-lm from Hugging Face.
Run a test prompt from your terminal:bashlitert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ gemma-4-E2B-it.litertlm \ --prompt="What is the capital of France?"
Confirm that the model answers and check memory usage and latency on your device.

LiteRT‑LM supports function calling and local tools through a preset file, which you can pass with --preset=preset.py.

That file defines tools and routing logic, which turns Gemma 4 into a local agent for tasks such as file search or web lookups.

How to Run or Use It

Using Gemma 4 with Ollama

After installation, you can start a chat with Gemma 4 with one command.

Open a terminal.
Start a chat with: ollama run gemma4:e2b (or another size tag).
Type a prompt such as “Summarize this news article” and paste text.
Press Enter and wait for the response.

Ollama maintains a conversation context so you can send follow‑up questions.
You can stop generation with Ctrl+C if needed.

To use Gemma 4 as a local API, run:

bashollama serve

Then call the http://localhost:11434/api/chat endpoint from your app, with model: "gemma4:e2b" in the JSON body. This lets you connect desktop apps, scripts, or browser extensions to Gemma 4 without cloud calls.

Using Gemma 4 with Python (Transformers)

A simple text‑only example with Gemma 4 E2B looks like this:

pythonfrom transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "google/gemma-4-E2B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") prompt = "Explain what a context window is in plain language." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For multimodal input, you use the Gemma 4 processor object, pass both image and text, and then decode the structured answer.

The Hugging Face page for Gemma 4 E2B shows complete examples for image questions and step‑by‑step response parsing.

Using Gemma 4 with LiteRT‑LM

LiteRT‑LM is useful when you need function calling, multiple tools, or low‑resource devices.

Run a basic prompt as shown earlier and measure latency.
Create a preset.py file that defines Python functions as tools, such as get_current_time or search_files.
Run the CLI with --preset=preset.py; Gemma 4 will return tool_call events as JSON, the CLI will run your function, and then the model will finish the answer.
Use LiteRT‑LM’s integrated web server option (if enabled in your version) to expose Gemma 4 as a local HTTP API for other apps.

Google reports that LiteRT‑LM can process 4,000 input tokens across two skills in under three seconds on supported edge hardware, which gives a sense of real‑world response times.

Benchmark Results

The table below summarizes key quality benchmarks from the Gemma 4 model card and related sources.

Quality benchmarks for Gemma 4

Benchmark / Task	Gemma 4 E2B	Gemma 4 E4B	Gemma 4 26B A4B	Gemma 4 31B
MMLU Pro (reasoning)	60.0%	69.4%	82.6%	85.2%
AIME 2026 no tools (math)	37.5%	42.5%	88.3%	89.2%
GPQA Diamond (science)	43.4%	58.6%	82.3%	84.3%
LiveCodeBench v6 (coding)	44.0%	52.0%	77.1%	80.0%
Tau2 (theorem proving)	24.5%	42.2%	68.2%	76.9%

These numbers show that even the small E2B and E4B models reach useful accuracy for many tasks, while 26B and 31B reach near‑frontier scores in reasoning and coding.

Performance on edge hardware

For local hardware, one published data point is Gemma 4 E2B on a Raspberry Pi 5 using LiteRT‑LM.

Around 133 tokens per second in the prefill phase.
Around 7.6 tokens per second during token‑by‑token decoding.
4,000 tokens across two skills processed in under three seconds in a local agent workflow.

These numbers show that useful local agents are possible even on small, low‑power boards when you choose the right model and runtime.

Testing Details

The quality benchmarks above come from Google’s official Gemma 4 model cards, NVIDIA’s partner cards, and community summaries. They measure knowledge and reasoning (MMLU Pro, Tau2), math (AIME 2026), coding (LiveCodeBench v6), and science QA (GPQA Diamond).

Each score represents accuracy on a held‑out test set under standard evaluation setups, such as few‑shot prompts or chain‑of‑thought where defined.

The Raspberry Pi 5 performance numbers come from an analysis of Google’s local agent stack with Gemma 4 and LiteRT‑LM. The test used Gemma 4 E2B, quantized for edge use, and measured both prefill and decode rates under a local agent workload with two tools.

NVIDIA’s published model card for Gemma 4 31B IT with NVFP4 quantization shows that accuracy remains close to the full‑precision baseline, which supports practical GPU deployment at lower cost.

Comparison Table

This table compares Gemma 4 E4B with three popular open local models as of April 2026.

Gemma 4 vs other local models

Model	Params (effective)	Context window	License	Typical VRAM need*	Multimodal	Notes
Gemma 4 E4B	~4B	128K tokens	Apache 2.0	~3 GB	Yes	Edge‑focused, strong reasoning for size.
Llama 3.1 8B	8B	128K tokens	Llama 3.1 Community	~8–10 GB (community est.)	Mainly text	High quality, larger footprint.
Qwen2.5 7B Instruct	7B	33K tokens	Apache 2.0	~4 GB	Text (base), some variants multimodal	Strong coding and math.
Mistral 7B Instruct	7.3B	8K–32K tokens (variants)	Apache 2.0	~4 GB	Text	Fast, efficient general model.

*VRAM values are approximate community numbers and depend on quantization and runtime.

The table shows that Gemma 4 E4B offers long context and multimodal support at a memory cost close to other 7B‑class models, with a more permissive license than Llama 3.1.

Pricing Table

Gemma 4 itself has no license fees.
You pay for hardware, power, and any paid platform or cloud service you choose.

Public pricing today mostly covers older Gemma 3 models and competitor APIs, but it offers a useful baseline.

Example cost options around Gemma 4

Option / Tier	What you pay for	Example pricing and notes
Local self‑host (PC, Pi, phone)	Hardware, power, and your time	Model weights are free under Apache 2.0; no per‑token fee.
Google Gemma 3 4B API (reference)	Cloud tokens (input and output)	Around 0.040 USD per 1M input tokens and 0.080 USD per 1M output tokens for Gemma 3 4B as of April 2026. Gemma 4 pricing is not yet listed but will likely sit in a similar price band.
Qwen2.5 7B Instruct API	Cloud tokens via Qwen or partners	About 0.040 USD per 1M input tokens and 0.100 USD per 1M output tokens.
Phi‑4 API	Cloud tokens via Microsoft or partners	Around 0.07 USD per 1M input tokens and 0.14 USD per 1M output tokens.
Enterprise platforms (Vertex, etc.)	Managed infra, monitoring, compliance, SLAs	Pricing depends on machine types (for example, A3 Ultra GPU nodes can cost tens of USD per hour).

For many users, the main “cost” of Gemma 4 is buying a GPU or using existing hardware. The open license makes it easier to keep per‑request cost low compared to cloud‑only models, especially at scale.

USP — What Makes Gemma 4 Different

Gemma 4 stands out because it ships not only as an open model family, but as a complete local agent stack across phones, PCs, and edge devices.

Google released open weights under Apache 2.0 along with Android AICore access, AI Edge Gallery “agent skills,” LiteRT‑LM runtimes, and day‑one support from NVIDIA and Hugging Face.

This tight integration means the gap between “announcement” and “working local setup” is small, compared with many earlier open models.

Pros and Cons

Pros

Open Apache 2.0 license supports commercial local use and redistribution.
Four sizes cover phones, Raspberry Pi, laptops, and workstations.
Long context windows up to 256K tokens for large documents and codebases.
Strong performance on reasoning, math, and coding benchmarks for its size classes.
Multimodal input with image understanding, OCR, and UI or chart analysis.
Good ecosystem support: Transformers, Ollama, LiteRT‑LM, NVIDIA stacks, and WebGPU demos.

Cons

Larger models (26B, 31B) need strong GPUs or expensive cloud hardware for best speed.
Small edge models trade some quality for speed and low memory use compared with large dense models.
Tools and runtimes evolve fast, so commands, tags, or package names can change over time.
Multimodal features require more complex code and higher memory use than text‑only runs.

Quick Comparison Chart

This quick chart focuses on the four Gemma 4 variants and how they fit local setups.

Gemma 4 variants at a glance

Variant	Effective params	Context window	Target devices	Typical use cases
Gemma 4 E2B	~2B	128K tokens	Phones, Raspberry Pi, Jetson Nano	Lightweight chat, local agents, small tools.
Gemma 4 E4B	~4B	128K tokens	Phones, small PCs, low‑end GPUs	General assistant, coding helper, document tasks.
Gemma 4 26B A4B	3.8B active	Up to 256K tokens	PCs with GPUs, workstations, accelerators	Higher‑quality reasoning and coding, agent backends.
Gemma 4 31B	31B dense	Up to 256K tokens	High‑end GPUs (for example, 80 GB H100), quantized for RTX PCs	Maximum quality, fine‑tuning base for custom models.

You can treat E2B as the entry point, E4B as the balanced local default, 26B A4B as the performance step‑up, and 31B as the quality‑first choice.

Demo or Real‑World Example

Example: Local coding assistant on a laptop with Ollama

This example builds a basic coding assistant using Gemma 4 E4B on a laptop with at least 16 GB RAM and a mid‑range GPU or fast CPU.

Install Ollama
Download and install Ollama for your operating system, then confirm it with ollama --version.
Pull Gemma 4 E4B
In a terminal, run:
ollama pull gemma4:e4b.
Wait for the download to finish.
Start a dedicated model session
Run:
ollama run gemma4:e4b.
This opens an interactive prompt.
Give it a system style
Paste a short instruction like:
“You are a careful coding assistant.
Explain code changes in clear, short steps.”
Then press Enter.
Ask for code help
Try prompts such as:
- “Write a Python function that checks if a list has duplicates. Then explain the logic.”
- “Refactor this function to improve readability,” then paste your code.
Review and test outputs
Copy the suggested code into your editor and run your test suite.
If the output needs changes, ask follow‑up prompts, such as “Make this work with Python 3.8 only.”
Connect to a local editor or IDE
- Run ollama serve to expose a local HTTP API.
- Use an editor extension or a small script to send file content and instructions to http://localhost:11434/api/chat with model: "gemma4:e4b".
- Display suggestions in a side panel or inline in your editor.

On a mid‑range GPU, Gemma 4 E4B should respond within a few seconds while keeping all code on your machine.
For slower laptops, you can switch to E2B by using the gemma4:e2b tag, which reduces memory use and speeds up responses.

Conclusion

Gemma 4 brings strong open models to devices that many people already own, from phones and Raspberry Pi boards to laptops and workstations.

Its Apache 2.0 license, long context windows, and strong reasoning and coding scores make it a solid base for local assistants and agents.

FAQ

1. Do I need a GPU to run Gemma 4?

No, you can run E2B or E4B on CPUs with runtimes like Ollama or LiteRT‑LM, although responses will be slower.
A GPU improves speed, especially for 26B and 31B.

2. Is Gemma 4 free for commercial use?

Yes, Gemma 4 uses the Apache 2.0 license, which allows commercial use, modification, and redistribution within its terms.
Cloud platforms that host Gemma models may still charge per token or per hour.

3. How much RAM or VRAM do I need?

E2B and E4B can run in around 3–4 GB of GPU memory or comparable system memory with quantization, while larger models need far more.
Always check your runtime’s docs for exact requirements for your quantization and batch size.

4. Can Gemma 4 work offline?

Yes, once you download the weights and any needed runtime, Gemma 4 can run without an internet connection.
This is one of the main benefits of local deployment.

5. How does Gemma 4 compare to Llama 3.1 or Qwen2.5?

Gemma 4 offers competitive or better reasoning and coding scores at similar or smaller parameter counts, plus Apache 2.0 licensing and strong edge support.
Llama 3.1 and Qwen2.5 still offer strong alternatives, especially in existing ecosystems, but focus less on a unified local agent stack.