Workers AI - NVIDIA Nemotron 3 Super now available on Workers AI
Key Points
- 120B total / 12B active params
- 32,000 token context window
- Use x-session-affinity for prompt caching
Summary
NVIDIA Nemotron 3 Super is now available on Workers AI. It's a Mixture-of-Experts (MoE) model with a hybrid Mamba-transformer architecture (120B total parameters, 12B active per forward pass) optimized for multi-agent and multi-turn applications that require high throughput, long contexts, tool calling, and fast long-form generation.
Key Points
- Architecture: hybrid Mamba-transformer MoE — 120B total params, 12B active params per forward pass.
- Throughput: over 50% higher token-generation throughput vs leading open models (lower latency for real-world apps).
- Context: 32,000 token context window for long conversations, plan states, and agent workflows.
- Multi-Token Prediction (MTP): predicts several future tokens per forward pass to accelerate long-form generation.
- Tool calling: supports multi-turn tool invocation for agent-style workflows.
- Prompt caching: use the
x-session-affinityheader with a unique session ID to route requests to the same model instance and reduce latency/costs.
Usage
- Workers AI binding: call via
env.AI.run()from your Worker. - REST API: call
/runor/v1/chat/completions. - OpenAI-compatible endpoint also supported; see the Nemotron 3 Super model page for details.
- For optimal multi-turn performance, include the
x-session-affinityheader to enable prompt caching and session affinity.
Practical Recommendations
- For agent systems, use MTP + tool-calling to batch long-generation work and reduce overall latency.
- Enable prompt caching for multi-turn flows to lower inference cost and response times.
- Monitor instance affinity and adjust session IDs to balance cache locality vs. horizontal scalability.
Links
- Refer to the Nemotron 3 Super model page and Workers AI documentation for API details, quotas, and example integrations.