Run Gemma 4 on Your PC and Devices Locally
Learn how to install, run, and benchmark Gemma 4 locally on PC, Mac, and edge devices with clear steps and real data.
Gemma 4 is Google’s newest open AI model and successor of Gemma 3 and Gemma 3n, Google's open AI model family that works well on local hardware, from phones to PCs.
You can run it on your own machine, keep data on device, and avoid cloud latency.
This guide explains what Gemma 4 is, how to install it, and how to run it in practice.
What Is Gemma 4?
Gemma 4 is an open family of language models from Google DeepMind, released under the Apache 2.0 license.
“Open weights” means you can download the model files, run them on your own hardware, and fine‑tune them for your use cases within the license terms.
The family has four main sizes: Gemma 4 E2B, E4B, 26B A4B (Mixture‑of‑Experts), and 31B (dense).
E2B and E4B target phones, Raspberry Pi, and small PCs, while 26B and 31B target laptops with GPUs, workstations, and servers. Gemma 4 supports text and images for input, and it outputs text.
The smaller models support context windows up to 128K tokens, and the larger ones reach up to 256K tokens, which helps with long documents and codebases.
The models handle more than 140 languages and focus on reasoning, coding, and general assistant tasks.
Gemma 4 uses both dense and Mixture‑of‑Experts architectures, which trade off quality and speed in different ways. Google designed Gemma 4 to work as a local “agent” stack, not only as a plain chat model.
Key Features
- Open Apache 2.0 license
You can use Gemma 4 commercially, redistribute it, and modify it, under a standard Apache 2.0 open‑source license. - Four model sizes for many devices
E2B and E4B focus on phones, Raspberry Pi, and small PCs, while 26B A4B and 31B target GPUs and servers. - Long context windows
E2B and E4B support up to 128K tokens; 26B A4B and 31B support up to 256K tokens for long documents and code. - Multimodal input
Gemma 4 accepts text and images and supports tasks like document parsing, UI screenshots, charts, OCR, and handwriting recognition. - Strong reasoning and coding performance
The 31B model scores around 85.2 percent on MMLU Pro and 80.0 percent on LiveCodeBench v6, which are advanced reasoning and coding benchmarks. - Edge‑ready runtime stack
LiteRT‑LM, Android AICore, and AI Edge Gallery provide ready runtimes and examples for phones, Raspberry Pi, Jetson, and PCs. - Local agent support
Built‑in function calling and tool use support let Gemma 4 act as the core of local agents, which can call APIs or local tools and then answer. - Broad ecosystem support
Gemma 4 ships with support in Hugging Face Transformers, llama.cpp‑style runtimes, WebGPU demos, and NVIDIA’s RTX and Jetson stacks.
How to Install or Set Up
Choose a local runtime
You have three main paths for local use:
- Ollama desktop runtime (Windows, macOS, Linux) for chat and simple APIs.
- Python with Hugging Face Transformers for script and app integration.
- LiteRT‑LM CLI for edge and agent workflows on Linux, macOS, Windows via WSL, and Raspberry Pi.
Pick one path based on your skills and target device.
Path 1: Install Gemma 4 with Ollama
Ollama is a desktop app and CLI that downloads and runs models for you.
- Go to the Ollama website and download the installer for Windows, macOS, or Linux.
- Install Ollama and restart your system if prompted, so the service can start.
- Open a terminal and run
ollama --versionto confirm that Ollama works. - In the same terminal, pull the default Gemma 4 model with:
ollama pull gemma4. - After the download finishes, run
ollama listto see available Gemma 4 variants and tags.
Ollama exposes different Gemma 4 sizes as tags:
gemma4:e2bfor the small edge model.gemma4:e4bfor the edge model with more capacity.gemma4:26bfor the 26B Mixture‑of‑Experts model.gemma4:31bfor the 31B dense model.
Choose E2B or E4B if you have a laptop with shared memory or a low‑end GPU.
Use 26B or 31B only if you have at least 24 GB of GPU memory or a strong workstation.
Path 2: Install Gemma 4 with Hugging Face Transformers
Use this path if you want Python control, custom prompts, or integration into your own app.
- Install a recent Python 3.10+ environment.
- Install PyTorch with GPU support if you have a compatible NVIDIA or Apple GPU.
- Install Transformers and related tools:
pip install -U transformers torch accelerate. - Log in to Hugging Face if the repo requires an access token.
- In your script, load a Gemma 4 repo, such as
google/gemma-4-E2Borgoogle/gemma-4-31B.
The Gemma 4 E2B Hugging Face page includes ready example code for chat prompts and image inputs.
You pass a message list into a processor, create tensors, call model.generate, and then parse the response with the same processor.
Path 3: Install Gemma 4 with LiteRT‑LM CLI
LiteRT‑LM is Google’s open‑source inference framework for edge LLMs.
Its CLI makes it easy to run models from a terminal, with no extra code.
- Install LiteRT‑LM CLI for your platform using the instructions in the official docs.
- Make sure Python and required system libraries are present, as the CLI may rely on them.
- Download a pre‑converted Gemma 4 LiteRT‑LM model, such as
litert-community/gemma-4-E2B-it-litert-lmfrom Hugging Face. - Run a test prompt from your terminal:bash
litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--prompt="What is the capital of France?" - Confirm that the model answers and check memory usage and latency on your device.
LiteRT‑LM supports function calling and local tools through a preset file, which you can pass with --preset=preset.py.
That file defines tools and routing logic, which turns Gemma 4 into a local agent for tasks such as file search or web lookups.
How to Run or Use It
Using Gemma 4 with Ollama
After installation, you can start a chat with Gemma 4 with one command.
- Open a terminal.
- Start a chat with:
ollama run gemma4:e2b(or another size tag). - Type a prompt such as “Summarize this news article” and paste text.
- Press Enter and wait for the response.
Ollama maintains a conversation context so you can send follow‑up questions.
You can stop generation with Ctrl+C if needed.
To use Gemma 4 as a local API, run:
bashollama serve
Then call the http://localhost:11434/api/chat endpoint from your app, with model: "gemma4:e2b" in the JSON body. This lets you connect desktop apps, scripts, or browser extensions to Gemma 4 without cloud calls.
Using Gemma 4 with Python (Transformers)
A simple text‑only example with Gemma 4 E2B looks like this:
pythonfrom transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "google/gemma-4-E2B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Explain what a context window is in plain language."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For multimodal input, you use the Gemma 4 processor object, pass both image and text, and then decode the structured answer.
The Hugging Face page for Gemma 4 E2B shows complete examples for image questions and step‑by‑step response parsing.
Using Gemma 4 with LiteRT‑LM
LiteRT‑LM is useful when you need function calling, multiple tools, or low‑resource devices.
- Run a basic prompt as shown earlier and measure latency.
- Create a
preset.pyfile that defines Python functions as tools, such asget_current_timeorsearch_files. - Run the CLI with
--preset=preset.py; Gemma 4 will returntool_callevents as JSON, the CLI will run your function, and then the model will finish the answer. - Use LiteRT‑LM’s integrated web server option (if enabled in your version) to expose Gemma 4 as a local HTTP API for other apps.
Google reports that LiteRT‑LM can process 4,000 input tokens across two skills in under three seconds on supported edge hardware, which gives a sense of real‑world response times.
Benchmark Results
The table below summarizes key quality benchmarks from the Gemma 4 model card and related sources.
Quality benchmarks for Gemma 4
These numbers show that even the small E2B and E4B models reach useful accuracy for many tasks, while 26B and 31B reach near‑frontier scores in reasoning and coding.
Performance on edge hardware
For local hardware, one published data point is Gemma 4 E2B on a Raspberry Pi 5 using LiteRT‑LM.
- Around 133 tokens per second in the prefill phase.
- Around 7.6 tokens per second during token‑by‑token decoding.
- 4,000 tokens across two skills processed in under three seconds in a local agent workflow.
These numbers show that useful local agents are possible even on small, low‑power boards when you choose the right model and runtime.
Testing Details
The quality benchmarks above come from Google’s official Gemma 4 model cards, NVIDIA’s partner cards, and community summaries. They measure knowledge and reasoning (MMLU Pro, Tau2), math (AIME 2026), coding (LiveCodeBench v6), and science QA (GPQA Diamond).
Each score represents accuracy on a held‑out test set under standard evaluation setups, such as few‑shot prompts or chain‑of‑thought where defined.
The Raspberry Pi 5 performance numbers come from an analysis of Google’s local agent stack with Gemma 4 and LiteRT‑LM. The test used Gemma 4 E2B, quantized for edge use, and measured both prefill and decode rates under a local agent workload with two tools.
NVIDIA’s published model card for Gemma 4 31B IT with NVFP4 quantization shows that accuracy remains close to the full‑precision baseline, which supports practical GPU deployment at lower cost.
Comparison Table
This table compares Gemma 4 E4B with three popular open local models as of April 2026.
Gemma 4 vs other local models
*VRAM values are approximate community numbers and depend on quantization and runtime.
The table shows that Gemma 4 E4B offers long context and multimodal support at a memory cost close to other 7B‑class models, with a more permissive license than Llama 3.1.
Pricing Table
Gemma 4 itself has no license fees.
You pay for hardware, power, and any paid platform or cloud service you choose.
Public pricing today mostly covers older Gemma 3 models and competitor APIs, but it offers a useful baseline.
Example cost options around Gemma 4
For many users, the main “cost” of Gemma 4 is buying a GPU or using existing hardware. The open license makes it easier to keep per‑request cost low compared to cloud‑only models, especially at scale.
USP — What Makes Gemma 4 Different
Gemma 4 stands out because it ships not only as an open model family, but as a complete local agent stack across phones, PCs, and edge devices.
Google released open weights under Apache 2.0 along with Android AICore access, AI Edge Gallery “agent skills,” LiteRT‑LM runtimes, and day‑one support from NVIDIA and Hugging Face.
This tight integration means the gap between “announcement” and “working local setup” is small, compared with many earlier open models.
Pros and Cons
Pros
- Open Apache 2.0 license supports commercial local use and redistribution.
- Four sizes cover phones, Raspberry Pi, laptops, and workstations.
- Long context windows up to 256K tokens for large documents and codebases.
- Strong performance on reasoning, math, and coding benchmarks for its size classes.
- Multimodal input with image understanding, OCR, and UI or chart analysis.
- Good ecosystem support: Transformers, Ollama, LiteRT‑LM, NVIDIA stacks, and WebGPU demos.
Cons
- Larger models (26B, 31B) need strong GPUs or expensive cloud hardware for best speed.
- Small edge models trade some quality for speed and low memory use compared with large dense models.
- Tools and runtimes evolve fast, so commands, tags, or package names can change over time.
- Multimodal features require more complex code and higher memory use than text‑only runs.
Quick Comparison Chart
This quick chart focuses on the four Gemma 4 variants and how they fit local setups.
Gemma 4 variants at a glance
You can treat E2B as the entry point, E4B as the balanced local default, 26B A4B as the performance step‑up, and 31B as the quality‑first choice.
Demo or Real‑World Example
Example: Local coding assistant on a laptop with Ollama
This example builds a basic coding assistant using Gemma 4 E4B on a laptop with at least 16 GB RAM and a mid‑range GPU or fast CPU.
- Install Ollama
Download and install Ollama for your operating system, then confirm it withollama --version. - Pull Gemma 4 E4B
In a terminal, run:ollama pull gemma4:e4b.
Wait for the download to finish. - Start a dedicated model session
Run:ollama run gemma4:e4b.
This opens an interactive prompt. - Give it a system style
Paste a short instruction like:
“You are a careful coding assistant.
Explain code changes in clear, short steps.”
Then press Enter. - Ask for code help
Try prompts such as:- “Write a Python function that checks if a list has duplicates. Then explain the logic.”
- “Refactor this function to improve readability,” then paste your code.
- Review and test outputs
Copy the suggested code into your editor and run your test suite.
If the output needs changes, ask follow‑up prompts, such as “Make this work with Python 3.8 only.” - Connect to a local editor or IDE
- Run
ollama serveto expose a local HTTP API. - Use an editor extension or a small script to send file content and instructions to
http://localhost:11434/api/chatwithmodel: "gemma4:e4b". - Display suggestions in a side panel or inline in your editor.
- Run
On a mid‑range GPU, Gemma 4 E4B should respond within a few seconds while keeping all code on your machine.
For slower laptops, you can switch to E2B by using the gemma4:e2b tag, which reduces memory use and speeds up responses.
Conclusion
Gemma 4 brings strong open models to devices that many people already own, from phones and Raspberry Pi boards to laptops and workstations.
Its Apache 2.0 license, long context windows, and strong reasoning and coding scores make it a solid base for local assistants and agents.
FAQ
1. Do I need a GPU to run Gemma 4?
No, you can run E2B or E4B on CPUs with runtimes like Ollama or LiteRT‑LM, although responses will be slower.
A GPU improves speed, especially for 26B and 31B.
2. Is Gemma 4 free for commercial use?
Yes, Gemma 4 uses the Apache 2.0 license, which allows commercial use, modification, and redistribution within its terms.
Cloud platforms that host Gemma models may still charge per token or per hour.
3. How much RAM or VRAM do I need?
E2B and E4B can run in around 3–4 GB of GPU memory or comparable system memory with quantization, while larger models need far more.
Always check your runtime’s docs for exact requirements for your quantization and batch size.
4. Can Gemma 4 work offline?
Yes, once you download the weights and any needed runtime, Gemma 4 can run without an internet connection.
This is one of the main benefits of local deployment.
5. How does Gemma 4 compare to Llama 3.1 or Qwen2.5?
Gemma 4 offers competitive or better reasoning and coding scores at similar or smaller parameter counts, plus Apache 2.0 licensing and strong edge support.
Llama 3.1 and Qwen2.5 still offer strong alternatives, especially in existing ecosystems, but focus less on a unified local agent stack.