Moonshot AI Kimi K2.5 now available on Workers AI
Key Points
- 256k-token context window
- prefix caching with discounted cached tokens
- pull-based async batch API (queueRequest: true)
Summary
Moonshot AI Kimi K2.5 (@cf/moonshotai/kimi-k2.5) is now available on Workers AI. It's a frontier-scale open-source model with a full 256k token context window, multi-turn tool calling, vision inputs, structured JSON outputs, function calling, and prefix caching. The model runs on Cloudflare's Developer Platform so you can operate the full agent lifecycle (inference, tools, and orchestration) in one place with lower cost and high throughput.
Key Points
- 256,000-token context window to retain full conversation history, tool definitions, and large codebases across long-running agent sessions.
- Multi-turn tool calling and function calling to build agents that invoke external tools/APIs across multiple turns.
- Vision inputs to process images alongside text and return structured outputs.
- Structured outputs with JSON mode and JSON Schema support for reliable downstream parsing.
- Prefix caching + session affinity: shared session context is cached to reduce prefill work, improving Time To First Token (TTFT) and Tokens Per Second (TPS); cached tokens are now surfaced as a usage metric and billed at a discount vs input tokens.
- Redesigned asynchronous, pull-based Batch API: submit batches with queueRequest: true, receive a request_id, and poll or use event notifications; internal tests typically complete async requests within ~5 minutes (depends on live traffic).
- Access routes: Workers AI binding (env.AI.run()), REST endpoints (/run or /v1/chat/completions), AI Gateway, or OpenAI-compatible endpoint. Session affinity can be set with the x-session-affinity header.
Practical notes for engineers
- Use prefix caching and session affinity for long-lived agents to reduce compute and cost.
- For high-volume non-real-time workloads (code scanning, research agents), use the async batch API to avoid capacity errors.
- Check the model page, pricing, and prompt-caching docs for token billing details and cached-token discounts before production rollout.