WebSocket Support in Responses API Accelerates Agentic Workflows by 40%
Key Points
- 40% faster agentic workflows with persistent WebSocket connections
- 1,000+ tokens/second throughput for GPT-5.3-Codex-Spark
- Backward-compatible API using previous_response_id parameter
Summary
OpenAI has implemented WebSocket support in the Responses API to eliminate latency bottlenecks in agentic workflows. By maintaining persistent connections and caching conversation state, the API now achieves 40% faster end-to-end performance, enabling GPT-5.3-Codex-Spark to reach its 1,000+ tokens-per-second target.
Key Points
- Persistent Connection Architecture: Replaced synchronous HTTP requests with WebSocket connections that maintain in-memory state, eliminating redundant processing of full conversation history on each request
- Familiar Developer API: Implemented
previous_response_idparameter to maintain backward compatibility—developers use the sameresponse.createinterface without rewriting integrations - Optimized State Caching: Connection-scoped cache stores previous response objects, input/output items, tool definitions, and rendered tokens, enabling incremental processing of only new information
- Measurable Impact: Codex ramped majority of traffic to WebSocket mode; Vercel AI SDK saw 40% latency reduction; Cline workflows improved 39%; Cursor integration achieved 30% faster performance
- Production Results: Achieved 1,000 TPS baseline with bursts up to 4,000 TPS on GPT-5.3-Codex-Spark, demonstrating the API can keep pace with faster inference hardware
Technical Approach
Selected WebSockets over gRPC bidirectional streaming for simplicity and minimal architectural disruption. The implementation treats local tool execution as hosted tool calls: the inference loop blocks after sampling a tool call, sends it to the client via WebSocket, and resumes when the client returns results.