Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.

Google dropped Gemma 4 in April 2026 and the E2B variant is the one worth paying attention to for local dev. The "E" stands for effective parameters — 2.3B effective, 5.1B total with embeddings. It runs on 8GB RAM at 4-bit quantization and still handles multimodal input, 128K context, and native function calling. That's a lot of model for hardware you already own.

Here's how to get it running, expose it as a local API, and actually use it.

Install Ollama

Head to ollama.com/download and grab the installer for your OS.

On Linux, the one-liner works:

curl -fsSL https://ollama.com/install.sh | sh

On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.

Check it's running:

ollama --version

If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.

Pull Gemma 4 E2B

ollama pull gemma4:e2b

That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.

Want to verify it's there?

ollama list

You should see gemma4:e2b in the output.

Run It

The quickest test is the CLI:

ollama run gemma4:e2b

That drops you into an interactive prompt. Type anything. Ctrl+D to exit.

For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e2b",
  "messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
  "stream": false
}'

Set "stream": true if you want token-by-token output instead of waiting for the full response.

The OpenAI-Compatible API

This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="gemma4:e2b",
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
    ]
)

print(response.choices[0].message.content)

Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage

Install Ollama

Pull Gemma 4 E2B

Run It

The OpenAI-Compatible API

Arbind Singh

Comments

GPT-Image-2 Is Not a DALL-E Upgrade. It's a Different Kind of Model.

Using It from Node.js

Hardware Reality Check

One Gotcha: Tool Calling

What It's Actually Good For