AI Search - New Workers AI models for text generation and embedding
Key Points
- Four new Workers AI models added
- GLM-4.7-Flash supports 131,072-token context
- Embeddings: qwen 4,096-token input; embeddinggemma low-latency
Summary
Cloudflare AI Search now includes four additional Workers AI models for text generation and embeddings. These models run on Workers AI (no external provider keys required) and are available when creating or updating an AI Search instance via the dashboard or the API. The additions target long-context generation and higher-capacity embeddings for indexing longer chunks or low-latency embedding workloads.
Key Points
- New text-generation models:
@cf/zai-org/glm-4.7-flash— GLM-4.7-Flash with a 131,072-token context window; ideal for long-document summarization and retrieval tasks.@cf/qwen/qwen3-30b-a3b-fp8— Qwen3-30B-A3B MOE model that activates ~3B parameters per forward pass for faster inference while keeping strong response quality; 32,000-token context.
- New embedding models:
@cf/qwen/qwen3-embedding-0.6b— 1,024-d vectors, supports up to 4,096 input tokens; suited for indexing longer text chunks; cosine similarity recommended.@cf/google/embeddinggemma-300m— 768-d vectors optimized for low-latency embedding workloads; cosine similarity recommended.
- Operational notes for engineers:
- No additional provider keys required — models run on Workers AI.
- Select these models in the AI Search dashboard or specify them via the API when creating/updating an AI Search instance.
- Use high-context models for long-document summarization/retrieval; choose embedding size and token support based on indexing chunk size and latency requirements.
Actionable recommendations
- For long documents or RAG flows, prefer
@cf/zai-org/glm-4.7-flash(131k tokens) or@cf/qwen/qwen3-30b-a3b-fp8(32k tokens) depending on latency/quality tradeoffs. - For embedding large text chunks, use
@cf/qwen/qwen3-embedding-0.6b(4k input tokens, 1024-d) to reduce chunking; use@cf/google/embeddinggemma-300mfor lower-latency, smaller vectors. - Test cosine similarity for nearest-neighbor search and tune chunk sizes against model token limits.
Reference
Published: 2026-04-08
Models available in Workers AI: @cf/zai-org/glm-4.7-flash, @cf/qwen/qwen3-30b-a3b-fp8, @cf/qwen/qwen3-embedding-0.6b, @cf/google/embeddinggemma-300m