Advancing voice intelligence with new models in the API
Key Points
- GPT‑Realtime‑2 with GPT‑5‑class reasoning
- Live translation: 70+ input → 13 output languages
- 128K context, parallel tool calls
Summary
This release adds three realtime audio models to the API that enable interactive voice agents capable of reasoning, translating, transcribing, and taking actions while people speak:
- GPT-Realtime-2 — GPT‑5‑class reasoning for live conversations, better context handling, and controllable tone/delivery.
- GPT-Realtime-Translate — live translation from 70+ input languages into 13 output languages.
- GPT-Realtime-Whisper — streaming speech-to-text that transcribes as users talk.
Key Points
- Longer context window: 128K (up from 32K) to support extended agentic workflows and multi-step tasks.
- Realtime agent features: preambles (audible “checking…” prompts), parallel tool calls, and tool transparency so actions can be announced while processing.
- Improved robustness: stronger recovery behavior for interruptions or failures and better retention of domain-specific vocabulary and proper nouns.
- Controllable behavior: adjustable reasoning effort (minimal, low (default), medium, high, xhigh) to balance latency vs. deliberation.
- Evaluation gains: GPT-Realtime-2 shows +15.2% on Big Bench Audio (high) and +13.8% on Audio MultiChallenge (xhigh) vs. prior Realtime-1.5.
- Primary patterns enabled: voice-to-action (agents that act), systems-to-voice (contextual spoken guidance), and voice-to-voice (live translation and multilingual conversation).
Engineering notes
- Choose reasoning level based on latency requirements: use low for fast turn-taking, medium/high/xhigh for complex reasoning.
- Use preambles and tool transparency to improve perceived responsiveness during long-running calls.
- Design streaming integrations for progressive transcription and mid-response tool calls; expect parallel tool invocation support.
- Use the 128K context for long sessions or multi-step workflows; plan memory and token usage accordingly.
- Test domain terms and alphanumerics in realistic audio to validate improved retention and handling.
Where to start
- Try GPT-Realtime-2 for interactive voice agents that must reason and act in real time.
- Use Realtime-Translate for live cross-language voice experiences and Realtime-Whisper for low-latency STT.