Workers AI - Google Gemma 4 26B A4B now available on Workers AI
Key Points
- MoE 26B total, ~4B active
- 256k token context window
- Available via env.AI.run() and REST endpoints
Summary
Cloudflare Workers AI now supports Google Gemma 4 26B A4B. Gemma 4 26B A4B is a Mixture-of-Experts (MoE) model (26B parameters total, ~4B active per forward pass) that delivers large-model quality with near-4B compute cost. It offers a 256k-token context window, reasoning "thinking" mode, strong multimodal (vision + OCR) capabilities, function-calling for tool workflows, multilingual support, and improved coding assistance.
Key Points
- Architecture: Mixture-of-Experts with 8 active experts out of 128 (+1 shared) to reduce inference compute while retaining high performance.
- Context: 256,000 token window to preserve long conversation history, documents, and tool definitions across sessions.
- Reasoning & tools: Built-in step-by-step thinking mode and native function-calling to enable agentic, multi-step workflows.
- Vision: Object detection, document/PDF parsing, UI/screen understanding, chart comprehension, multilingual OCR, and handwriting recognition.
- Multilingual & coding: Pretrained on 140+ languages with out-of-the-box support for 35+ languages; optimized for code generation, completion, and correction.
Usage
- Workers AI binding: call via env.AI.run().
- REST API: /run or /v1/chat/completions.
- OpenAI-compatible endpoint is also supported.
Practical Notes
- Expect latency and cost similar to a 4B dense model for many use cases but with accuracy closer to larger dense models; evaluate on your workloads.
- See the Gemma 4 26B A4B model page for model specifics, limits, and best practices.