Google Gemma 4 — Open Source AI That Runs on Your Phone

Google just released Gemma 4, the latest iteration of its open-source model family, and this time the story is different: it runs locally on Android phones. No cloud. No API calls. No latency. Just pure on-device inference.

Under the Apache 2.0 license, Gemma 4 is free for commercial use, modification, and distribution. This is a significant move in the ongoing open-source AI race, and it has implications that reach far beyond the developer community.

What Is Gemma 4?

Gemma 4 is Google's fourth-generation open model family, built on the same research and technology that powers Gemini. It comes in multiple sizes:

Model	Parameters	Quantized Size	Target Device
Gemma 4 Nano	2B	~1.4 GB	Android phones, IoT
Gemma 4 Small	9B	~5.2 GB	High-end phones, tablets
Gemma 4 Medium	27B	~15 GB	Laptops, desktops
Gemma 4 Large	70B	~40 GB	Servers, cloud

The Nano and Small variants are the ones designed for mobile. Google optimized them using quantization (4-bit and 8-bit), distillation, and architecture tweaks to run efficiently on mobile NPUs and GPUs.

Benchmarks: How Does It Perform?

Early benchmark results show Gemma 4 punching above its weight class:

Benchmark	Gemma 4 Nano (2B)	Gemma 4 Small (9B)	Gemma 3 Small (9B)	Llama 4 Scout (9B)
MMLU	58.2	72.4	64.1	70.8
HumanEval	41.5	62.8	51.2	59.4
GSM8K	64.7	78.3	69.5	75.1
MT-Bench	6.8	8.1	7.2	7.8

The 9B model rivals models twice its size from the previous generation. The Nano model, while limited, is remarkably capable for something that fits on a phone.

Running Locally on Android

This is where Gemma 4 gets interesting. Google has integrated Gemma 4 support into ML Kit and Android's AI Edge framework, making it straightforward to run inference on-device.

How It Works

1. Model download: Apps can bundle the model or download it on first launch

2. On-device inference: Uses the phone's NPU (Neural Processing Unit) or GPU

3. No internet required: Everything runs locally after download

4. Privacy-first: User data never leaves the device

Performance on Real Devices

Device	Gemma 4 Nano (tokens/sec)	Gemma 4 Small (tokens/sec)
Pixel 10	28	12
Samsung S26	25	11
iPhone 17 (via GGUF)	22	9
Budget phone ($300)	14	5

At 28 tokens per second on a Pixel 10, the Nano model delivers responses faster than most cloud APIs once you factor in network latency.

Comparing to Other Open Models

Gemma 4 vs Llama 4 Scout

Meta's Llama 4 Scout (9B) was the previous open-source champion in this size class. Gemma 4 Small edges it out on most benchmarks, but the real differentiator is mobile optimization:

•Gemma 4: First-class Android/iOS support via ML Kit, NPU-optimized
•Llama 4: Better for server-side inference, larger community ecosystem

Gemma 4 vs Mistral Small

Mistral's small models are fast but not designed for mobile. Gemma 4 wins on:

•On-device inference speed
•Memory efficiency
•Mobile framework integration

Mistral wins on:

•Raw quality per parameter
•Server-side throughput
•Multilingual performance

Gemma 4 vs Phi-4 Mini

Microsoft's Phi-4 Mini is the closest competitor for on-device AI. It's slightly smaller (3.8B) and focuses on reasoning. Phi-4 Mini has better math performance, but Gemma 4 Nano is faster on mobile hardware.

Use Cases for On-Device AI

1. Smart Keyboard with Context Awareness

A keyboard app that understands the full context of your conversation and suggests complete, relevant replies without sending anything to a server.

2. Offline Document Summarization

Summarize PDFs, articles, and notes directly on your phone during a flight or in a dead zone. No upload, no waiting.

3. Private Code Assistant

Developers can get code completion and explanation on their phone or laptop without sending proprietary code to any API.

4. Voice-First Interfaces

Combined with on-device speech-to-text, Gemma 4 enables fully local voice assistants that understand context and nuance.

5. Accessibility Features

Real-time captioning, text simplification, and visual description — all running locally, all private.

The Developer Experience

Getting started with Gemma 4 on Android is straightforward:

// Using ML Kit with Gemma 4
val model = GemmaModel.Builder()
    .setModelVariant(GemmaModel.Variant.NANO)
    .setQuantization(Quantization.INT4)
    .build()

val response = model.generate("Explain quantum computing simply")

For cross-platform, you can use the MediaPipe framework or llama.cpp with GGUF quantized versions available on Hugging Face.

Why This Matters

Google releasing a production-ready, Apache 2.0, mobile-optimized model is a watershed moment for several reasons:

Privacy Becomes Default

When inference happens on-device, there's no data to leak, no API to monitor, no server to breach. Privacy isn't a policy — it's an architecture.

Developing Markets Get AI

Billions of people have phones but unreliable internet. On-device AI means they get smart features regardless of connectivity.

Reduced Infrastructure Costs

For app developers, on-device inference means zero API costs at scale. No more worrying about per-token pricing as your user base grows.

The App Ecosystem Shifts

We're going to see a new category of apps that are AI-first and cloud-optional. This changes the economics of building AI-powered software.

Limitations to Know About

•Hallucinations: Still present, especially in the Nano model
•Context windows: Mobile models have shorter context (4K-8K tokens) vs cloud models (128K+)
•Battery impact: Sustained inference drains battery noticeably
•Storage: Even quantized models take 1-5 GB of storage
•Quality gap: The 2B Nano model is useful but not comparable to GPT-5 or Claude for complex tasks

What's Next

Google has signaled that future Gemma releases will focus on:

•Multimodal capabilities (image + text on-device)
•Fine-tuning on-device (personalized models without cloud)
•Agent frameworks for on-device AI agents

The trajectory is clear: the phone in your pocket is becoming the primary AI inference device.

Getting Started

•Hugging Face: google/gemma-4
•Kaggle Models: Free Gemma 4 notebooks and benchmarks
•Android AI Edge: Official Google documentation for on-device inference
•MediaPipe: Cross-platform inference framework

Bottom Line

Gemma 4 isn't the smartest model ever built. But it might be the most important one released this year. By making capable AI run locally on phones under an Apache 2.0 license, Google is pushing the industry toward a future where AI is private, free, and always available — no cloud required.

For developers, the question isn't whether to add on-device AI to your app. It's how fast you can ship it.

Google Gemma 4 — Open Source AI That Runs on Your Phone