Gemma 4N vs Gemma 4: What You Should Actually Run

If you searched for Gemma 4N and expected to find a named model — the way Gemma 3N was a named model — you won't find one. Google did not release a Gemma 4N. What they released instead is a complete rethinking of the on-device efficiency model, renamed under the Effective (E) branding. This article explains what Gemma 4N would have been, what replaced it, and which Gemma 4 variant you should actually be running for your use case.

What Was Gemma 3N? (and Why the "N" Naming Matters)

To understand where Gemma 4N fits — or doesn't — you need to understand what Gemma 3N was. Introduced by Google DeepMind in mid-2025, Gemma 3N was the on-device branch of the Gemma 3 family. The "N" stood for the next-generation on-device architecture, built around two key innovations:

Per-Layer Embeddings (PLE): Instead of one shared embedding table across the model, Gemma 3N gave each decoder layer its own smaller embedding. These embedding tables are large in token count but use a reduced dimension (256 vs the standard 1536+ of full models), and they are memory-mapped rather than loaded into active compute. This means they add to the model's total parameter count without adding proportional inference cost.
MatFormer architecture: A nested transformer design that allows the model to dynamically use fewer parameters for simpler queries, reducing compute on constrained hardware.

The result: Gemma 3N could run on 6–8 GB RAM mobile devices while delivering quality that outpunched its active parameter count. For a deeper look at how Gemma 3N compared to standard Gemma 3, our architecture breakdown covers the differences in full.

Gemma 3N signalled that Google was bifurcating the Gemma family: one branch for servers, one for devices. Gemma 4 continues that bifurcation — but drops the "N" label entirely.

Is There a Gemma 4N?

No. When Google launched the Gemma 4 family on April 2, 2026, no model named Gemma 4N appeared. The on-device efficiency concept that defined Gemma 3N is alive and well — but it has been absorbed into the main Gemma 4 release under a different naming convention: the Effective models.

The naming shift reflects a change in positioning. Rather than marketing on-device models as a separate "N" branch, Google folded the efficiency architecture into the core Gemma 4 release, giving the efficient variants the "E" prefix to signal effective parameter count rather than total parameter count. The result is the same idea — more intelligence per byte than a naive parameter count implies — with cleaner branding across the whole family.

If you were looking for Gemma 4N, look at Gemma 4 E2B and Gemma 4 E4B.

Meet the Real Successors: Gemma 4 E2B and E4B

Gemma 4 E2B and E4B are the on-device variants of the Gemma 4 family, directly analogous to what Gemma 3N was in the previous generation. The "E" prefix stands for Effective — a reference to the effective parameter count, which is substantially lower than the total parameter count due to how Per-Layer Embeddings are counted and stored.

Per-Layer Embeddings: The Architecture That Defines "Efficient"

In plain terms: the E-models have a large number of parameters on paper, but most of those parameters sit in storage rather than active compute. Here is how it works:

Each decoder layer has its own embedding table with a reduced dimensionality — 256 dimensions in E2B, compared to 1536+ in a standard full model.
The embedding tables cover all 262,144 tokens in the vocabulary, so a full E2B deployment adds 262,144 × 256 × N layers of embedding parameters on top of the active model weights.
These embedding tables are memory-mapped — they live on storage (SSD or flash) and are accessed on demand, not loaded into VRAM or working RAM during inference.
The compute-active part of E2B is only 2.3B parameters. The total parameter count (including all embeddings) rises to 5.1B — but those extra parameters cost minimal inference compute and do not require VRAM.

For E4B: 4.5B effective parameters, 8B total with embeddings. The same memory-mapping principle applies, and the same split between "what you load" and "what the spec sheet says" holds.

The Full Gemma 4 Family Compared

Gemma 4 ships as four distinct models. Here is how the whole family stacks up, including the context window difference that often gets overlooked in comparisons:

Gemma 4 E2B: 2.3B effective / 5.1B total (with PLE) — Dense + PLE — 128K context — Mobile / Embedded
Gemma 4 E4B: 4.5B effective / 8B total (with PLE) — Dense + PLE — 128K context — Laptop / Edge
Gemma 4 26B A4B: 3.8B active / 26B total (MoE) — Mixture of Experts — 256K context — Consumer GPU
Gemma 4 31B: 31B / 31B — Dense — 256K context — Workstation / Server

The context window difference matters in practice: E-models cap at 128K tokens, while the 26B and 31B models support 256K. For most on-device tasks this is irrelevant, but for long-document pipelines, extended agentic workflows, or large codebase ingestion, the larger models have a meaningful advantage.

For context on how Gemma 4 compares with the previous generation overall, see our breakdown of Gemma 4 vs Gemma 3 vs Gemma 3N.

Hardware Requirements and Real-World Performance

One of the defining advantages of the E-models is how little hardware they require. Here are the concrete numbers:

Gemma 4 E2B (2-bit quantized): Under 1.5 GB RAM — runs on smartphones and Raspberry Pi-class hardware. Embedding tables are stored on disk and do not count toward active RAM.
Gemma 4 E2B (4-bit): ~5 GB RAM — runs on most modern phones and low-end laptops
Gemma 4 E4B (4-bit): ~5 GB RAM — same footprint as E2B at 4-bit, meaningfully higher output quality
Gemma 4 E4B (16-bit / full precision): ~15 GB RAM — suited to higher-end laptops or developer workstations
Gemma 4 26B A4B (4-bit): ~18 GB VRAM — targets RTX 4090, RTX 5060 Ti, RX 7900 XTX class GPUs
Gemma 4 31B (4-bit): ~20 GB VRAM — consumer workstation GPU or entry-level server

Google has published these inference benchmarks for the E2B on edge hardware:

Raspberry Pi 5 (CPU only): 7.6 decode tokens/second
Qualcomm Dragonwing IQ8 (NPU accelerated): 31 decode tokens/second

For most interactive use cases, 7+ tokens/second is usable in practice; 31 tokens/second on Qualcomm NPU is fast enough for real-time chat and edge inference pipelines. The E4B on a mid-range laptop GPU will deliver noticeably better output quality at the same RAM cost as E2B.

The easiest deployment paths are Ollama (supports both E2B and E4B out of the box), llama.cpp (for precise quantization control), and MediaPipe LLM Inference (for Android and iOS). For the full setup walkthrough, see our guide on running Gemma 4 locally on your device.

Which Gemma 4 Model Should You Choose?

The right choice depends on where your code runs and what you need from the model:

Android / iOS app or embedded device: Gemma 4 E2B. Under 1.5 GB at 2-bit quantization, multimodal (text + vision + audio), and MediaPipe-compatible for on-device deployment. The direct upgrade from Gemma 3N.
Laptop or local dev machine (8–16 GB RAM): Gemma 4 E4B at 4-bit. Better quality than E2B with the same storage footprint, fast enough for interactive use even on integrated GPU.
Consumer gaming GPU (16–24 GB VRAM): Gemma 4 26B A4B. The Mixture-of-Experts design activates only 3.8B parameters per token despite the 26B total, delivering near-30B quality on a 16 GB card. Ranked #6 on the open model Arena AI leaderboard.
Workstation or server with 24 GB+ VRAM: Gemma 4 31B. The highest-quality model in the family, ranked #3 on Arena AI among all open models. Full 256K context window; best for production agentic pipelines and demanding reasoning tasks.

If you were running Gemma 3N for on-device work, the direct upgrade path is Gemma 4 E2B or E4B. The architecture is the same — Per-Layer Embeddings, memory-mapped storage — and the quality improvement is significant thanks to multimodal capabilities and the underlying Gemini 3 research base. For a refresher on the Gemma 3N architecture before migrating, the Gemma 3N local setup guide remains a useful reference.

The Gemma 4 family also marks an important licensing shift: all four models are now under the Apache 2.0 license, removing restrictions on commercial use that applied to earlier Gemma versions. For production workloads, this removes a meaningful legal friction point.

To summarise: search for Gemma 4N, and you will not find it — but you will find two models that are arguably better than Gemma 3N was at launch, with cleaner hardware targeting and a more coherent family structure behind them.