How to Build a Voice Agent with GPT-Realtime-2 (May 2026)
How to Build a Voice Agent with GPT-Realtime-2 (May 2026)
OpenAI’s Realtime API went GA on May 8, 2026 with three new models. GPT-Realtime-2 is the one to use for voice agents that need to reason and act. Here’s a step-by-step build guide for production voice apps in May 2026.
Last verified: May 11, 2026
What you’re building
A voice agent that:
- Listens to a user speak (microphone in)
- Reasons about the request (GPT-5-class model)
- Optionally calls tools (DB queries, API lookups, calendar checks)
- Responds in natural speech (voice out)
- Handles interruptions and turn-taking gracefully
GPT-Realtime-2 is the right model for this. GPT-Realtime-Whisper handles transcription only. GPT-Realtime-Translate handles cross-language voice. GPT-Realtime-2 is the reasoning voice agent.
Step 1: Prerequisites
- OpenAI API key with Realtime API access (GA as of May 8, 2026 — no allowlist required)
- A backend you can host (Node, Python, Go — any language with WebSocket/WebRTC libs)
- A frontend (browser, mobile app, or native client) that can capture and play audio
For browser-based agents, WebRTC is the lower-latency path. For server-to-server voice (telephony, IVR, call-center), WebSocket is the standard route.
Step 2: Open the Realtime connection
WebRTC (browser, lowest latency)
// Get an ephemeral session token from your backend
const tokenRes = await fetch('/api/realtime/session');
const { client_secret } = await tokenRes.json();
// Open a WebRTC peer connection to OpenAI
const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));
// Receive remote audio (the agent's voice)
pc.ontrack = (e) => {
const audio = document.querySelector('#agent-audio');
audio.srcObject = e.streams[0];
};
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpRes = await fetch(
'https://api.openai.com/v1/realtime?model=gpt-realtime-2',
{
method: 'POST',
headers: {
Authorization: `Bearer ${client_secret.value}`,
'Content-Type': 'application/sdp',
},
body: offer.sdp,
}
);
await pc.setRemoteDescription({ type: 'answer', sdp: await sdpRes.text() });
The browser is now streaming microphone audio to OpenAI and playing the agent’s voice back. Sub-300ms round-trip is achievable when the user is regionally close to OpenAI’s endpoints.
WebSocket (server-side, telephony)
import WebSocket from 'ws';
const ws = new WebSocket(
'wss://api.openai.com/v1/realtime?model=gpt-realtime-2',
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'OpenAI-Beta': 'realtime=v1',
},
}
);
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: {
voice: 'alloy',
instructions: 'You are a helpful voice assistant.',
modalities: ['text', 'audio'],
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
},
}));
});
Then stream audio frames over the WebSocket as input_audio_buffer.append events.
Step 3: Add tools
The big upgrade in GPT-Realtime-2 is full function calling mid-conversation. Define tools in session.update:
session: {
voice: 'alloy',
instructions: 'You are a customer support agent for Andrew.ooo.',
tools: [
{
type: 'function',
name: 'lookup_order',
description: 'Look up an order by ID',
parameters: {
type: 'object',
properties: {
order_id: { type: 'string', description: 'The order ID' },
},
required: ['order_id'],
},
},
],
tool_choice: 'auto',
}
When the model calls lookup_order, you receive a response.function_call_arguments.done event. Execute the function and send the result back:
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: event.call_id,
output: JSON.stringify({ status: 'shipped', tracking: 'ABC123' }),
},
}));
ws.send(JSON.stringify({ type: 'response.create' }));
The agent will incorporate the result into its spoken response.
Step 4: Handle interruptions (barge-in)
Users will interrupt. When the user starts speaking while the agent is talking, you must:
- Detect speech start (the API emits
input_audio_buffer.speech_started). - Cancel the in-progress response:
ws.send(JSON.stringify({ type: 'response.cancel' }));
- Truncate the in-progress audio output on the client to stop playback immediately.
- The model will pick up the user’s new turn.
This is the difference between a voice agent that feels human and one that feels robotic.
Step 5: Tune for production
Short system prompts. Every byte adds latency on every turn. Keep instructions tight.
Tool handlers must be fast. Pre-warm DB connections, cache common lookups, return partial results when possible.
Audio format matters. PCM16 at 24kHz is the standard. Don’t transcode unnecessarily.
Voice selection. Pick the voice that fits the brand. Alloy, Echo, Shimmer, etc. — pick once, stay consistent.
Region. Run your backend in the same cloud region as OpenAI’s endpoint (US East / US West typically). Cross-region adds 50-150ms per round-trip.
Logging. Capture session IDs, turn IDs, tool calls, and latency timings. Voice agent debugging without logs is brutal.
Cost monitoring. Realtime API bills audio in + audio out + model inference. Set per-session caps. Long monologues from confused users will burn through budget fast.
Step 6: Route by job
Not every voice workload needs GPT-Realtime-2’s reasoning. Route by job:
- Live captions or transcripts? →
gpt-realtime-whisper(cheaper) - Cross-language voice? →
gpt-realtime-translate(purpose-built) - Reasoning + tools + conversation? →
gpt-realtime-2
A typical multi-capability app uses all three behind a routing layer.
Common pitfalls
Forgetting response.cancel on barge-in. Agent keeps talking over the user. Voice feels broken.
Slow tool handlers. Conversation stalls visibly. Users repeat themselves.
Long system prompts. Adds 200-500ms to every turn.
Wrong audio format. Sample rate mismatch → garbled audio.
No session caps. Runaway costs from stuck sessions.
No mobile testing. Network jitter on cellular reveals problems desktop testing hides.
What to watch next
- WebRTC vs WebSocket performance as Realtime API matures.
- Pricing optimization — multi-tier audio quality, cached system prompt support.
- Native MCP server tool calls from Realtime sessions.
- Cross-model routing — agents that hand off between Realtime-2 and Realtime-Translate mid-conversation.
Related reading
- GPT-Realtime-Translate vs Whisper vs Realtime-2
- What is GPT-Realtime-2
- How to migrate Realtime API beta to GPT-Realtime-2
- GPT-Realtime-2 vs ElevenLabs vs Gemini Live
Last verified: May 11, 2026 — sources: OpenAI Realtime API GA release notes, OpenAI WebRTC docs, MarkTechPost Realtime models coverage, DataCamp GPT-Realtime-2 deep dive.