WebSocket mode for Responses API: 40% faster agentic workflows
Key Points
- Up to 40% faster agentic workflows
- WebSocket mode reuses in-memory response state
- Achieves 1,000+ tokens/sec for fast models
Summary
OpenAI reduced agent rollout latency by introducing a WebSocket mode in the Responses API that keeps a persistent, connection-scoped in-memory cache of previous response state. Instead of rebuilding full conversation context for each follow-up, clients can continue with response.create and previous_response_id while the server reuses cached artifacts (rendered tokens, tool defs, sampling state). The change removes redundant work, shortens validation paths, and preserves a familiar API shape for developers.
Key Points
- Persistent connection: WebSocket mode maintains a long-lived connection and caches previous response state in memory for the lifetime of that connection.
- Familiar integration: Clients continue to call response.create with previous_response_id; no major shape changes required for common integrations.
- Prototype events (used conceptually): sampling loop can emit response.done and accept response.append to resume after client tool execution; production kept the simpler response.create/previous_response_id flow.
- Cached state includes: previous response object, prior input/output items, tool definitions/namespaces, and reusable sampling artifacts (e.g., rendered tokens).
- Performance optimizations enabled: skip tokenization by reusing rendered tokens, run safety/validation on only new input, reduce network hops, reuse model routing, and overlap postinference work (billing) with subsequent requests.
- Results: engineers reported up to ~40% faster end-to-end agentic workflows; the platform reached ~1,000 TPS for GPT-5.3-Codex-Spark with bursts to 4,000 TPS in production.
Practical guidance for engineers
- When building agentic or tool-using workflows, prefer WebSocket mode to minimize per-turn API overhead and tokenization costs.
- Keep using response.create and pass previous_response_id for follow-ups so the server can fetch cached state.
- Design client-side tool calls to send the minimal new data back over the socket (e.g., results of a local tool run) to avoid revalidation of the full history.
- Expect quicker time-to-first-token and lower end-to-end latency, but validate safety and state-handling in your integration (e.g., reconnect/recovery semantics for long-lived connections).
Bottom line
WebSocket mode trades a persistent transport for much lower redundant processing per turn. It’s the recommended pattern for high-throughput, low-latency agentic workflows using the Responses API.