What are the three new OpenAI Realtime models in May 2026?

On May 8, 2026, OpenAI released three audio models in the Realtime API: (1) GPT-Realtime-2 — a voice agent model with GPT-5-class reasoning, tool use, and conversational task execution. (2) GPT-Realtime-Translate — a live speech-to-speech translation model for multilingual conversations. (3) GPT-Realtime-Whisper — a streaming transcription and captioning model. Alongside the model releases, the Realtime API officially exited beta and is now generally available, replacing the prior `gpt-4o-realtime-preview` beta endpoints.

When should I use Realtime-2 vs Realtime-Translate vs Realtime-Whisper?

Pick by job. (1) Realtime-2 — voice agents that need to reason, use tools, look things up, and respond conversationally. Customer support, voice assistants, interactive demos. (2) Realtime-Translate — speech-to-speech translation in live conversation. Multilingual meetings, support agents, travel assistants. Optimized for low-latency cross-language voice without going through text round-trips. (3) Realtime-Whisper — streaming transcription only. Live captions, meeting transcripts, accessibility, call-center QA. Doesn't talk back, just transcribes.

What replaced the GPT-Realtime API beta in May 2026?

The Realtime API exited beta on May 8, 2026. The prior `gpt-4o-realtime-preview` endpoint is superseded by the three new models. Existing integrations need to migrate to one of: `gpt-realtime-2` (voice agents with reasoning), `gpt-realtime-translate` (live translation), or `gpt-realtime-whisper` (streaming transcription). Migration requires updating model identifiers, reviewing pricing differences, and validating audio formats — see OpenAI's Realtime API migration guide.

Can I use one Realtime model and have it do everything?

No. The three Realtime models are specialized by design. Realtime-2 won't transcribe at the quality and price of Realtime-Whisper. Realtime-Whisper doesn't generate speech or use tools. Realtime-Translate is optimized for cross-language voice but isn't a general-purpose conversational agent. For multi-capability apps (a voice agent that also captions and translates), wire the three models together — Realtime-Whisper for captions, Realtime-Translate for language switching, Realtime-2 for the conversational reasoning.

Quick Answer

GPT-Realtime-Translate vs Whisper vs Realtime-2 (May 2026)

Published: May 11, 2026

GPT-Realtime-Translate vs Whisper vs Realtime-2 (May 2026)

OpenAI shipped three Realtime audio models on May 8, 2026, and the Realtime API graduated to GA. Three models, three jobs. Here’s how to pick and what each one is for.

Last verified: May 11, 2026

At a glance

Property	GPT-Realtime-2	GPT-Realtime-Translate	GPT-Realtime-Whisper
Released	May 8, 2026	May 8, 2026	May 8, 2026
Primary job	Voice agent + reasoning	Live speech translation	Streaming transcription
Reasoning level	GPT-5-class	Translation-tuned	None (STT only)
Speaks back	Yes (voice out)	Yes (voice out)	No
Tool use	Yes (full)	Limited	No
Best for	Voice assistants, support, demos	Multilingual conversations	Captions, transcripts, accessibility
Latency	Low	Very low	Very low
API surface	Realtime API (GA)	Realtime API (GA)	Realtime API (GA)

What changed on May 8

Three things happened together:

1. Realtime API exits beta. The gpt-4o-realtime-preview endpoint is superseded. The API is now GA with stable contracts.

2. Three specialized audio models. Instead of one general “realtime voice” model, OpenAI split capabilities into three specialized models — one for voice agents, one for translation, one for transcription.

3. New developer surface. Tool use, function calls, and streaming audio are unified under a cleaner Realtime API. Existing integrations migrate by updating model identifiers and re-validating audio formats.

GPT-Realtime-2: the voice agent

What it is. A voice-in, voice-out model with GPT-5-class reasoning, full tool use, and conversational task execution.

Best for:

Customer support voice agents
Voice assistants in apps
Interactive demos at conferences and trade shows
Voice-driven onboarding
Conversational tutors and language partners

Strengths:

Reasons through multi-step tasks while you’re talking to it
Tool calls in the middle of conversation (look up an order, query a DB)
Maintains conversation state with natural turn-taking
Handles interruptions and barge-in gracefully

Limits:

More expensive per minute than Whisper or Translate
Not the right pick if you only need transcription
Not the right pick if you only need cross-language voice

GPT-Realtime-Translate: live speech-to-speech

What it is. A live speech-to-speech translation model — listens in one language, speaks in another, optimized for low-latency conversations.

Best for:

Multilingual customer support
Real-time interpretation in meetings
Travel assistants
Cross-language voice calls
Diplomatic and business interpretation aids

Strengths:

Skips text round-trip (faster than STT → translate → TTS pipeline)
Voice-to-voice latency tuned for conversation pacing
Handles many language pairs in a single endpoint
Preserves prosody and tone where possible

Limits:

Specialized for translation — not a general-purpose voice agent
Reasoning capability is translation-focused, not GPT-5-class
Doesn’t replace a transcription product

GPT-Realtime-Whisper: streaming transcription only

What it is. A streaming transcription model — audio in, text out, in real time. The Realtime API descendant of OpenAI’s Whisper line.

Best for:

Live captions for video and broadcasts
Real-time meeting transcripts
Accessibility (deaf/HoH live caption services)
Call-center QA and compliance
Voice notes and dictation pipelines
Podcast/video production workflows

Strengths:

Lowest latency to first transcript token
Cheapest of the three for transcription-only workloads
High-accuracy multilingual transcription
Streaming-first design (don’t wait for end-of-utterance)

Limits:

No voice output (text only)
No reasoning or tool use
Translation is not its strength (use Realtime-Translate)

Picking between them

Need a voice agent that thinks and acts?         → Realtime-2
Need live cross-language voice?                  → Realtime-Translate
Need live captions or transcripts?               → Realtime-Whisper

For multi-capability apps, wire the three together:

Captions tab → Realtime-Whisper
Language-switch UI → Realtime-Translate
“Talk to the assistant” → Realtime-2

Migrating from `gpt-4o-realtime-preview`

The beta endpoint is being deprecated in favor of the GA Realtime API.

Migration steps:

Identify which job your integration is actually doing.
Pick the right new model — Realtime-2, Realtime-Translate, or Realtime-Whisper.
Update model identifier in API calls.
Validate audio formats (sample rates, codecs) against the GA spec.
Re-test tool calling and function call surfaces — these moved under the GA Realtime API.
Update prompt scaffolding (system messages, conversation init).
Run pricing impact analysis — the GA prices differ from beta.

OpenAI’s migration guide covers the endpoint and format details. Plan for a few hours of integration time per existing app.

What to watch next

Pricing transparency — OpenAI’s published pricing for the three models should stabilize over the coming weeks.
Tool-use surface — independent benchmarks on Realtime-2’s tool-call reliability vs ElevenLabs Conversational AI.
Latency comparisons — Realtime-Translate vs traditional STT → translate → TTS pipelines.
Whisper model parity — whether OpenAI also updates the offline Whisper line.

Last verified: May 11, 2026 — sources: OpenAI Realtime API release notes, MarkTechPost, KnightLI, WindowsReport, DataCamp Realtime-2 deep dive, TheAITrack.

GPT-Realtime-Translate vs Whisper vs Realtime-2 (May 2026)

At a glance

What changed on May 8

GPT-Realtime-2: the voice agent

GPT-Realtime-Translate: live speech-to-speech

GPT-Realtime-Whisper: streaming transcription only

Picking between them

Migrating from gpt-4o-realtime-preview

What to watch next

Related reading

Migrating from `gpt-4o-realtime-preview`