Gemini 3.1 Flash-Lite: fast, low-cost inference for high-volume workloads
Key Points
- Preview via Gemini API and Vertex AI
- Low cost: $0.25/1M input, $1.50/1M output
- 2.5× faster TTFAT and 45% faster output
Summary
Gemini 3.1 Flash-Lite is a new, preview-tier Gemini 3 model for high-volume developer and enterprise workloads. It is available via the Gemini API in Google AI Studio (developers) and via Vertex AI (enterprises). Flash-Lite targets low-latency, cost-sensitive inference with multimodal capabilities and configurable "thinking levels" for balancing speed vs. reasoning.
Key Points
- Availability: preview in Gemini API (AI Studio) and enterprise access via Vertex AI.
- Pricing: $0.25 per 1M input tokens; $1.50 per 1M output tokens — optimized for high throughput.
- Performance: 2.5× faster Time-to-First-Answer-Token vs. 2.5 Flash and 45% faster output speed (Artificial Analysis); Elo 1432 on Arena.ai.
- Benchmarks: 86.9% on GPQA Diamond, 76.8% on MMMU Pro; matches or exceeds prior Gemini models of similar tier.
- Controls: built-in thinking levels let engineers tune latency vs. depth of reasoning for high-frequency workflows.
- Strengths: suitable for translation, content moderation, UI/dashboard generation, simulations, bulk multimodal content analysis, and real-time experiences.
- Early adopters: companies like Latitude, Cartwheel, and Whering report strong efficiency and reasoning for complex inputs.
Recommendations for engineers
- Choose Flash-Lite for high-volume, low-cost inference (real-time UX, streaming moderation, bulk translation).
- Increase thinking level when tasks require deeper reasoning (UI generation, multi-step agents, simulations); lower it for high-frequency, latency-sensitive workloads.
- Validate output quality against your task-specific benchmarks (GPQA/MMMU-like tests) before full rollout.