Advancing voice intelligence with new models in the API
Key Points
- Three new realtime audio models with reasoning and translation
- Context window expanded to 128K tokens for complex conversations
- Live translation across 70+ languages with real-time transcription
Summary
OpenAI has released three new realtime audio models that enable developers to build sophisticated voice applications with reasoning, translation, and transcription capabilities. These models move beyond simple call-and-response interactions toward intelligent voice agents that can understand context, use tools, and take action during conversations.
Key Points
- GPT-Realtime-2: First voice model with GPT-5-class reasoning for complex requests, featuring adjustable reasoning levels (minimal to xhigh), preambles, parallel tool calls, and stronger recovery behavior
- GPT-Realtime-Translate: Live translation model supporting 70+ input languages to 13 output languages, enabling real-time multilingual conversations
- GPT-Realtime-Whisper: Streaming speech-to-text that transcribes speech live as speakers talk
Capabilities
- Context window expanded from 32K to 128K tokens for longer, more coherent sessions
- Improved domain understanding for specialized terminology and proper nouns
- Controllable tone and delivery for appropriate emotional responses
- Parallel tool execution with audible status updates
- Better handling of corrections and interruptions
Performance
- GPT-Realtime-2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence
- GPT-Realtime-2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following
Use Cases
Developers can build voice-to-action systems (reasoning through requests), systems-to-voice (proactive guidance), and voice-to-voice (multilingual support) applications.