If you’ve been waiting for open models that can actually compete with the big proprietary players, your patience just paid off. Google DeepMind announced Gemma 4 today, and this isn’t just an incremental update — it’s a fundamental shift in what’s possible with models you can actually run on your own hardware.
In this post, we’ll break down what makes Gemma 4 special, why the Apache 2.0 license matters, and how you can start using these models today — whether you’re running a workstation with beefy GPUs or just your phone.
The Numbers That Matter
Let’s cut straight to what everyone wants to know: how good are these models?
The 31B Dense model currently ranks as the #3 open model in the world on the Arena AI text leaderboard. The 26B Mixture of Experts (MoE) variant sits at #6. Here’s the kicker — Gemma 4 is outcompeting models that are 20 times its size.
Google is releasing four model sizes:
- E2B (Effective 2B) — Engineered for mobile and IoT devices
- E4B (Effective 4B) — A step up while still prioritizing edge deployment
- 26B MoE — Focuses on latency, activating only 3.8 billion parameters during inference
- 31B Dense — Maximum quality and a powerful foundation for fine-tuning
Since the first generation launched, developers have downloaded Gemma over 400 million times, spawning more than 100,000 variants in what Google calls the “Gemmaverse.” That’s not hype — that’s a genuine ecosystem.
Built for Reasoning and Agents
The Gemma 4 family was purpose-built for two things that matter enormously right now: advanced reasoning and agentic workflows.
What does that mean in practice?
Multi-step planning and deep logic. These models show significant improvements on math benchmarks and instruction-following tasks that require genuine reasoning chains. If you’ve been frustrated by models that lose the thread on complex problems, this is the target they’re aiming at.
Native agentic capabilities. Function calling, structured JSON output, and native system instructions are built in. You can create autonomous agents that interact with tools and APIs, executing workflows reliably without wrestling with prompt engineering hacks. The model understands what it means to be an agent from the ground up.
High-quality code generation. Google explicitly calls out the ability to turn your workstation into a “local-first AI code assistant.” For developers who want powerful code completion without sending everything to the cloud, this is significant.
Multimodal by Default
Every model in the Gemma 4 family natively processes video and images with variable resolution support. They excel at visual tasks like OCR and chart understanding — practical capabilities that matter for real applications.
The edge models (E2B and E4B) go further with native audio input for speech recognition and understanding. Multimodal isn’t an afterthought or a separate model variant; it’s baked into the architecture.
Context windows are generous too. Edge models support 128K tokens, while the larger models offer up to 256K. That’s enough to pass entire codebases or long documents in a single prompt without chunking strategies.
The Apache 2.0 Shift
This is the headline that should make enterprise developers sit up: Gemma 4 ships under an Apache 2.0 license.
Google states this directly: “You gave us feedback, and we listened.” Previous Gemma releases had more restrictive terms. Apache 2.0 is a commercially permissive license that provides what Google calls “complete developer flexibility and digital sovereignty.”
What that means practically:
- Complete control over your data, infrastructure, and models
- Freedom to build and deploy across any environment — on-premises or cloud
- No restrictive barriers that would complicate commercial use
For enterprises worried about data sovereignty, compliance requirements, or simply wanting to avoid vendor lock-in, this is a major development. You get state-of-the-art capabilities with the licensing terms that legal and compliance teams actually like.
Running on Real Hardware
Google sized these models intentionally for the hardware developers actually have.
The 31B and 26B models are optimized for workstations and servers. Unquantized bfloat16 weights fit on a single 80GB NVIDIA H100 GPU. Quantized versions run on consumer GPUs, bringing frontier-level reasoning to your local IDE and coding assistants.
The 26B MoE variant is particularly clever. It activates only 3.8 billion of its total parameters during inference, which translates to exceptionally fast tokens-per-second while maintaining quality. If latency matters to your application, this is the one to watch.
The E2B and E4B models are where things get interesting for mobile developers. These run completely offline with near-zero latency on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano. Google collaborated with their Pixel team, Qualcomm Technologies, and MediaTek to optimize for real edge hardware.
Android developers can prototype agentic flows in the AICore Developer Preview today, with forward-compatibility for Gemini Nano 4. If you’re building on-device AI experiences, the runway just got a lot longer.
140+ Languages Out of the Box
Gemma 4 was natively trained on over 140 languages. This isn’t translation as an afterthought — it’s multilingual capability built into the foundation.
For developers building applications for global audiences, this removes a significant barrier. You get high-performance language capabilities without needing separate models or complex translation pipelines.
The Ecosystem Play
Day-one support covers essentially every framework that matters:
- Hugging Face (Transformers, TRL, Transformers.js, Candle)
- Ollama, llama.cpp, MLX, LM Studio
- vLLM, SGLang for production serving
- NVIDIA NIM and NeMo
- Google’s own LiteRT-LM, Keras, Vertex AI
You can download weights from Hugging Face, Kaggle, or Ollama. Fine-tuning works on Google Colab, Vertex AI, or even your gaming GPU.
For production scale, Google Cloud offers deployment through Vertex AI, Cloud Run, GKE, with TPU-accelerated serving and compliance guarantees for regulated workloads. The path from prototype to production is well-paved.
Hardware optimization spans NVIDIA infrastructure from Jetson Orin Nano to Blackwell GPUs, AMD GPUs via the ROCm stack, and Google’s Trillium and Ironwood TPUs.
What This Means for You
If you’ve been on the fence about open models versus proprietary APIs, Gemma 4 changes the calculus significantly.
You’re getting models that compete with the best proprietary offerings, under licensing terms that work for commercial use, optimized for hardware you probably already have. The “intelligence-per-parameter” breakthrough means achieving frontier capabilities with significantly less hardware overhead.
For on-device AI specifically, the E2B and E4B models open up use cases that simply weren’t practical before. Completely offline, near-zero latency, multimodal understanding — on a phone or a Raspberry Pi.
Google explicitly positions this as complementary to their proprietary Gemini models. You get the industry’s most powerful combination of open and proprietary tools, letting you choose the right approach for each use case.
The Gemma 4 Good Challenge on Kaggle is accepting entries for those who want to build products creating meaningful, positive change. If you’re looking for a project, that’s one place to start.
Getting Started
The fastest path to experimenting:
- Google AI Studio for the 31B and 26B MoE models
- AI Edge Gallery for E4B and E2B
- Hugging Face for model weights
- Ollama for one-line local deployment
For Android development, check out Agent Mode in Android Studio and the ML Kit GenAI Prompt API.
The models are live, the license is permissive, and the ecosystem is ready. Time to build.