NVIDIA Nemotron 3 Super Model Now Available on Cloudflare Workers AI
Key Points
- 50% higher token generation throughput with hybrid Mamba-transformer architecture
- Multi-Token Prediction accelerates long-form text generation
- 32K token context window with prompt caching support
Summary
Cloudflare has partnered with NVIDIA to bring the Nemotron 3 Super model (@cf/nvidia/nemotron-3-120b-a12b) to Workers AI. This Mixture-of-Experts (MoE) model features a hybrid Mamba-transformer architecture with 120B total parameters and 12B active parameters per forward pass, optimized for multi-agent applications.
Key Features
- Hybrid Architecture: Mamba-transformer design delivers 50%+ higher token generation throughput compared to leading open models
- Tool Calling: Native support for building AI agents that can invoke tools across multiple conversation turns
- Multi-Token Prediction (MTP): Accelerates long-form text generation by predicting multiple future tokens simultaneously
- Large Context Window: 32,000 token context for maintaining conversation history and agent state
- Prompt Caching: Use
x-session-affinityheader with unique session ID for reduced latency and costs
Access Methods
- Workers AI binding (
env.AI.run()) - REST API endpoints (
/runor/v1/chat/completions) - OpenAI-compatible endpoint