How do I get started with GPT-Realtime-2 for voice agents?

Three steps. (1) Get a valid OpenAI API key with Realtime API access enabled. (2) Open a WebSocket (or WebRTC) connection to the Realtime API endpoint and authenticate. (3) Specify the model `gpt-realtime-2` in your session.update payload and start streaming audio. The Realtime API went GA on May 8, 2026, so the contracts are stable. Start with the OpenAI Realtime quickstart for Node, Python, or browser-side WebRTC, then layer in your system prompt, tools, and voice settings.

What's the lowest-latency way to build a voice agent with GPT-Realtime-2?

Use WebRTC, not WebSocket. WebRTC lets the browser stream audio directly to OpenAI's servers and receive audio back with sub-300ms round-trip times. Then: (1) Keep system prompts short — every byte costs time. (2) Use response.cancel aggressively on barge-in to stop wasted generation. (3) Pre-warm tool-call handlers so the agent doesn't stall waiting for HTTP. (4) Stream tool results back incrementally where possible. (5) Run inference and your tool servers in the same region as your users to cut network latency.

Can GPT-Realtime-2 call tools and APIs during a voice conversation?

Yes — that's the headline upgrade vs the older `gpt-4o-realtime-preview` beta. GPT-Realtime-2 supports full function calling and tool use mid-conversation. You define tools in session.update with name/description/parameters JSON Schema, the model invokes them as the conversation flows, your server returns results, and the model continues speaking with the tool output incorporated. Use this for order lookups, calendar bookings, account checks, RAG queries, anything that needs live data during the conversation.

How much does GPT-Realtime-2 cost vs the older Realtime beta?

OpenAI's published Realtime API pricing applies in May 2026 — separate audio input/output rates plus model inference. Exact per-minute cost depends on conversation length and tool usage; expect a meaningful uplift vs the older `gpt-4o-realtime-preview` beta because GPT-Realtime-2 has GPT-5-class reasoning. For pure transcription, route to `gpt-realtime-whisper` instead — it's cheaper. For pure translation, route to `gpt-realtime-translate`. Use `gpt-realtime-2` only when reasoning and tool use justify the cost.

Quick Answer

How to Build a Voice Agent with GPT-Realtime-2 (May 2026)

Published: May 11, 2026

How to Build a Voice Agent with GPT-Realtime-2 (May 2026)

OpenAI’s Realtime API went GA on May 8, 2026 with three new models. GPT-Realtime-2 is the one to use for voice agents that need to reason and act. Here’s a step-by-step build guide for production voice apps in May 2026.

Last verified: May 11, 2026

What you’re building

A voice agent that:

Listens to a user speak (microphone in)
Reasons about the request (GPT-5-class model)
Optionally calls tools (DB queries, API lookups, calendar checks)
Responds in natural speech (voice out)
Handles interruptions and turn-taking gracefully

GPT-Realtime-2 is the right model for this. GPT-Realtime-Whisper handles transcription only. GPT-Realtime-Translate handles cross-language voice. GPT-Realtime-2 is the reasoning voice agent.

Step 1: Prerequisites

OpenAI API key with Realtime API access (GA as of May 8, 2026 — no allowlist required)
A backend you can host (Node, Python, Go — any language with WebSocket/WebRTC libs)
A frontend (browser, mobile app, or native client) that can capture and play audio

For browser-based agents, WebRTC is the lower-latency path. For server-to-server voice (telephony, IVR, call-center), WebSocket is the standard route.

Step 2: Open the Realtime connection

WebRTC (browser, lowest latency)

// Get an ephemeral session token from your backend
const tokenRes = await fetch('/api/realtime/session');
const { client_secret } = await tokenRes.json();

// Open a WebRTC peer connection to OpenAI
const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));

// Receive remote audio (the agent's voice)
pc.ontrack = (e) => {
  const audio = document.querySelector('#agent-audio');
  audio.srcObject = e.streams[0];
};

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const sdpRes = await fetch(
  'https://api.openai.com/v1/realtime?model=gpt-realtime-2',
  {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${client_secret.value}`,
      'Content-Type': 'application/sdp',
    },
    body: offer.sdp,
  }
);
await pc.setRemoteDescription({ type: 'answer', sdp: await sdpRes.text() });

The browser is now streaming microphone audio to OpenAI and playing the agent’s voice back. Sub-300ms round-trip is achievable when the user is regionally close to OpenAI’s endpoints.

WebSocket (server-side, telephony)

import WebSocket from 'ws';

const ws = new WebSocket(
  'wss://api.openai.com/v1/realtime?model=gpt-realtime-2',
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      'OpenAI-Beta': 'realtime=v1',
    },
  }
);

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      voice: 'alloy',
      instructions: 'You are a helpful voice assistant.',
      modalities: ['text', 'audio'],
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16',
    },
  }));
});

Then stream audio frames over the WebSocket as input_audio_buffer.append events.

Step 3: Add tools

The big upgrade in GPT-Realtime-2 is full function calling mid-conversation. Define tools in session.update:

session: {
  voice: 'alloy',
  instructions: 'You are a customer support agent for Andrew.ooo.',
  tools: [
    {
      type: 'function',
      name: 'lookup_order',
      description: 'Look up an order by ID',
      parameters: {
        type: 'object',
        properties: {
          order_id: { type: 'string', description: 'The order ID' },
        },
        required: ['order_id'],
      },
    },
  ],
  tool_choice: 'auto',
}

When the model calls lookup_order, you receive a response.function_call_arguments.done event. Execute the function and send the result back:

ws.send(JSON.stringify({
  type: 'conversation.item.create',
  item: {
    type: 'function_call_output',
    call_id: event.call_id,
    output: JSON.stringify({ status: 'shipped', tracking: 'ABC123' }),
  },
}));
ws.send(JSON.stringify({ type: 'response.create' }));

The agent will incorporate the result into its spoken response.

Step 4: Handle interruptions (barge-in)

Users will interrupt. When the user starts speaking while the agent is talking, you must:

Detect speech start (the API emits input_audio_buffer.speech_started).
Cancel the in-progress response:

ws.send(JSON.stringify({ type: 'response.cancel' }));

Truncate the in-progress audio output on the client to stop playback immediately.
The model will pick up the user’s new turn.

This is the difference between a voice agent that feels human and one that feels robotic.

Step 5: Tune for production

Short system prompts. Every byte adds latency on every turn. Keep instructions tight.

Tool handlers must be fast. Pre-warm DB connections, cache common lookups, return partial results when possible.

Audio format matters. PCM16 at 24kHz is the standard. Don’t transcode unnecessarily.

Voice selection. Pick the voice that fits the brand. Alloy, Echo, Shimmer, etc. — pick once, stay consistent.

Region. Run your backend in the same cloud region as OpenAI’s endpoint (US East / US West typically). Cross-region adds 50-150ms per round-trip.

Logging. Capture session IDs, turn IDs, tool calls, and latency timings. Voice agent debugging without logs is brutal.

Cost monitoring. Realtime API bills audio in + audio out + model inference. Set per-session caps. Long monologues from confused users will burn through budget fast.

Step 6: Route by job

Not every voice workload needs GPT-Realtime-2’s reasoning. Route by job:

Live captions or transcripts? → gpt-realtime-whisper (cheaper)
Cross-language voice? → gpt-realtime-translate (purpose-built)
Reasoning + tools + conversation? → gpt-realtime-2

A typical multi-capability app uses all three behind a routing layer.

Common pitfalls

Forgetting response.cancel on barge-in. Agent keeps talking over the user. Voice feels broken.

Slow tool handlers. Conversation stalls visibly. Users repeat themselves.

Long system prompts. Adds 200-500ms to every turn.

Wrong audio format. Sample rate mismatch → garbled audio.

No session caps. Runaway costs from stuck sessions.

No mobile testing. Network jitter on cellular reveals problems desktop testing hides.

What to watch next

WebRTC vs WebSocket performance as Realtime API matures.
Pricing optimization — multi-tier audio quality, cached system prompt support.
Native MCP server tool calls from Realtime sessions.
Cross-model routing — agents that hand off between Realtime-2 and Realtime-Translate mid-conversation.

Last verified: May 11, 2026 — sources: OpenAI Realtime API GA release notes, OpenAI WebRTC docs, MarkTechPost Realtime models coverage, DataCamp GPT-Realtime-2 deep dive.

How to Build a Voice Agent with GPT-Realtime-2 (May 2026)

What you’re building

Step 1: Prerequisites

Step 2: Open the Realtime connection

WebRTC (browser, lowest latency)

WebSocket (server-side, telephony)

Step 3: Add tools

Step 4: Handle interruptions (barge-in)

Step 5: Tune for production

Step 6: Route by job

Common pitfalls

What to watch next

Related reading