Run Gemma 4 E2B Locally with Ollama: Setup, API, and Real Usage
How to pull and run Google's Gemma 4 E2B model locally with Ollama, expose it as an OpenAI-compatible endpoint, and wire it into real workflows without touching a cloud API.

Most AI setups start with "sign up, get API key, add credit card." Gemma 4 E2B running through Ollama starts with one shell command. That's the actual pitch.
Google dropped Gemma 4 in April 2026 and the E2B variant is the one worth paying attention to for local dev. The "E" stands for effective parameters — 2.3B effective, 5.1B total with embeddings. It runs on 8GB RAM at 4-bit quantization and still handles multimodal input, 128K context, and native function calling. That's a lot of model for hardware you already own.
Here's how to get it running, expose it as a local API, and actually use it.
Install Ollama
Head to ollama.com/download and grab the installer for your OS.
On Linux, the one-liner works:
curl -fsSL https://ollama.com/install.sh | sh
On Mac, unpack the zip and move it to Applications. The server runs in the background automatically after install.
Check it's running:
ollama --version
If you're on an Apple Silicon Mac, Ollama v0.19+ automatically uses MLX for inference. You don't have to configure anything. On NVIDIA, it uses CUDA. CPU fallback works too, just slower.
Pull Gemma 4 E2B
ollama pull gemma4:e2b
That's roughly a 2.5GB download at Q4 quantization. Ollama stores models in ~/.ollama/models and manages everything from there.
Want to verify it's there?
ollama list
You should see gemma4:e2b in the output.
Run It
The quickest test is the CLI:
ollama run gemma4:e2b
That drops you into an interactive prompt. Type anything. Ctrl+D to exit.
For programmatic use, Ollama exposes a REST endpoint at http://localhost:11434. The /api/chat endpoint handles multi-turn conversations:
curl http://localhost:11434/api/chat -d '{
"model": "gemma4:e2b",
"messages": [{"role": "user", "content": "Explain what a closure is in Python"}],
"stream": false
}'
Set "stream": true if you want token-by-token output instead of waiting for the full response.
The OpenAI-Compatible API
This is where things get useful. Ollama also serves an OpenAI-compatible endpoint at /v1. That means any tool built for the OpenAI SDK works against your local model with one config change.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="gemma4:e2b",
messages=[
{"role": "system", "content": "You are a code reviewer."},
{"role": "user", "content": "Review this Python function for edge cases: def divide(a, b): return a / b"}
]
)
print(response.choices[0].message.content)
Comments
Tagged