AI agents · OpenClaw · self-hosting · automation

Quick Answer

What Is GPT-Realtime-2? OpenAI's New Voice Model (May 2026)

Published:

What Is GPT-Realtime-2? OpenAI’s New Voice Model (May 2026)

OpenAI released GPT-Realtime-2 on May 7, 2026, as the new flagship of a refreshed Realtime API that simultaneously exited beta. The headline upgrades — GPT-5-class reasoning, 128K context, parallel tool calls — change what’s practical to build with voice. Here’s what GPT-Realtime-2 actually is, what it can do, and how it fits in the May 2026 voice stack.

Last verified: May 10, 2026

The announcement at a glance

PropertyValue
ReleasedMay 7, 2026
ProviderOpenAI
ArchitectureSpeech-to-speech (audio in, audio out)
Reasoning classGPT-5-class
Context window128,000 tokens (up from 32K)
Tool callingParallel function calls supported
Reasoning effortminimal / low / medium / high / xhigh
Audio input price$32 / 1M tokens
Audio output price$64 / 1M tokens
API statusGA — Realtime API out of beta

What GPT-Realtime-2 actually is

GPT-Realtime-2 is the speech-to-speech version of OpenAI’s GPT-5-class frontier model. “Speech-to-speech” means the model receives audio and emits audio directly — there’s no intermediate text round-trip through ASR + LLM + TTS. That’s the architectural choice that gives Realtime models lower latency and more natural prosody than turn-based stacks.

The May 7, 2026 release shipped three audio models together:

  • GPT-Realtime-2 — the flagship reasoning voice agent.
  • GPT-Realtime-Translate — dedicated speech-to-speech translator (70+ input → 13 output languages, $0.034/min).
  • GPT-Realtime-Whisper — streaming transcription only.

GPT-Realtime-2 replaces the original GPT-Realtime as OpenAI’s recommended default for conversational voice agents.

The five upgrades that actually matter

1. GPT-5-class reasoning in the voice loop

The original GPT-Realtime was capable but reasoning-light. It struggled with multi-step requests, planning, and constraint-heavy tasks. GPT-Realtime-2 brings GPT-5-class reasoning into the speech-to-speech loop, while still maintaining conversational pacing.

In practice: voice agents can now handle “book me a flight from Boston to Tokyo, layover not in San Francisco, in budget under $1500, leaving Tuesday morning” in one turn. The original model needed to be walked through it.

2. 128K context window (up from 32K)

The 4x context expansion removes the most common production workaround — chunking long calls or stripping conversation history to fit in 32K. With 128K you can:

  • Hold full call history for sessions over an hour.
  • Ground voice agents in product manuals, knowledge bases, or call scripts inline.
  • Keep multi-tool conversation state across many turns without aggressive pruning.

3. Parallel tool calls

GPT-Realtime-2 can issue multiple tool calls in a single turn and narrate that work to the user (“let me check your order and look up the return policy at the same time”). For voice agents that hit multiple backends per turn (CRM + product DB + ticketing) this cuts latency by 30-50% in production benchmarks.

4. Preambles and recovery

Two conversational behaviors that solve the most jarring voice agent failures:

  • Preambles — when the model knows it’s about to do work that will take a moment, it fills the silence. “Let me check that for you.” This is what makes the agent feel responsive rather than dead.
  • Recovery — when the user interrupts mid-sentence or speaks over a tool call, the agent recovers instead of crashing or repeating itself.

These are the “unsexy but essential” features that distinguish demo voice agents from production ones.

5. Adjustable reasoning effort

A new reasoning_effort parameter — minimal, low, medium, high, xhigh — trades latency for reasoning depth.

  • minimal — drive-thru order taking, FAQ lookup, simple commands. Sub-second time-to-first-audio.
  • medium (default) — most consumer voice agents. Good balance.
  • high / xhigh — complex enterprise workflows, troubleshooting, scheduling under constraints. Higher latency, materially better answers.

This is the same knob OpenAI exposes on text reasoning models, now plumbed through to voice.

Pricing breakdown

ModelAudio inputAudio outputNotes
GPT-Realtime-2$32 / 1M tokens$64 / 1M tokensDefault for conversational
GPT-Realtime-Translate$0.034 / minute (flat)includedTranslation only, no dialog
GPT-Realtime-WhisperWhisper streaming pricingn/aTranscription only

Practical cost rule of thumb for GPT-Realtime-2: budget roughly $0.30-0.60 per minute of two-way conversation, depending on speech density and tool call frequency. Translation-only workloads are ~10x cheaper using GPT-Realtime-Translate.

Migration from the GPT-Realtime beta

The migration path is short:

  1. Rename the model in your session config: gpt-realtimegpt-realtime-2.
  2. Add reasoning_effort: "medium" as a default; tune per use case.
  3. Enable parallel tool calls if your tools are independent — measurable latency win.
  4. Render preamble text on the client UI while audio streams.
  5. Verify your context window assumptions — long sessions that previously had to truncate at 32K can now run native to 128K.

Most teams ship the change in a single PR.

What GPT-Realtime-2 is not

  • Not a translation specialist. Use GPT-Realtime-Translate for translation-only workloads. 10x cheaper, purpose-built.
  • Not a transcription service. Use Whisper streaming.
  • Not the cheapest voice option. ElevenLabs + a smaller LLM still beats it on cost when voice quality is the product but reasoning isn’t critical.
  • Not on-device. Gemini Live runs natively on Pixel; GPT-Realtime-2 is API-only.

What to watch next

  • Production SLA numbers — first published uptime data post-GA.
  • Voice library expansion — OpenAI is expected to expand the voice catalog over Q3 2026.
  • Multimodal voice agents — combining camera/screen + voice + tool use is the next frontier.
  • Agent SDK support — how OpenAI’s Assistants API and the agentic stack take advantage of Realtime-2’s parallel tools.

Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook for realtime translation, MarktechPost, TheNextWeb, Latent.Space, 9to5Mac, Microsoft Azure AI Foundry blog, OpenAI community forum.