Gemma 4: Byte for byte, the most capable open models
Key Points
- Apache 2.0 open release
- E2B/E4B (edge) and 26B MoE / 31B dense (workstation)
- 128K–256K context, multimodal, function-calling
Summary
Gemma 4 is Google DeepMind's new family of open models released under the Apache 2.0 license, optimized for high intelligence-per-parameter across edge, workstation, and accelerator hardware. The family includes four sizes—E2B (effective 2B), E4B (effective 4B), 26B Mixture-of-Experts (MoE), and 31B Dense—designed for advanced reasoning, agentic workflows, multimodal inputs (vision, audio), long contexts, and local-first code generation and inference.
Key Points
- Models: E2B, E4B (edge/mobile, native audio), 26B MoE (latency-optimized; ~3.8B activated params), 31B Dense (highest raw quality).
- Context windows: up to 128K for edge models, up to 256K for larger models (repo/long-doc workflows).
- Core capabilities: multi-step reasoning, function-calling / structured JSON, system instructions, code generation, vision & OCR, audio (E2B/E4B), 140+ languages.
- Hardware & formats: unquantized bfloat16 weights fit a single 80GB H100; quantized variants target consumer GPUs and mobile/Jetson; optimized for NVIDIA (Blackwell/Jetson), ROCm (AMD), Trillium/Ironwood TPUs.
- Developer experience: weights available via Hugging Face, Kaggle, Ollama; day-one runtime support (Transformers, vLLM, llama.cpp, Candle, LiteRT-LM, etc.); deploy via Vertex AI, Cloud Run, GKE, or on-device for offline use.
- Licensing & safety: commercially permissive Apache 2.0 license; same infrastructure security protocols as proprietary models.
Practical notes for engineers
- If you need local high-quality inference/fine-tuning, the 31B model runs unquantized on a single 80GB H100; use quantized builds for consumer GPUs.
- For low-latency on-device apps, prototype with E2B/E4B (128K context, native audio) and test on Pixel/Raspberry Pi/Jetson/Orin Nano.
- Use the 26B MoE for token throughput-sensitive workloads—it activates fewer params per inference for faster tokens-per-second.
- Fine-tuning and adaptation are supported across common platforms (Colab, Vertex AI, local GPUs); weights and tooling are open and portable for on-prem or cloud deployments.
- Start integrating via supported runtimes (Hugging Face Transformers / TRL, vLLM, llama.cpp, etc.) and consider Vertex or TPU serving for production scale.