GPT-Realtime-2 vs ElevenLabs vs Gemini Live — what's the difference?

Three different shapes of voice-first AI. (1) GPT-Realtime-2 (OpenAI, GA May 7, 2026) is a speech-to-speech model with GPT-5-class reasoning, 128K context (up from 32K), parallel tool calls, conversational repair behavior (preambles, recovery), and adjustable reasoning effort (minimal to xhigh). $32/1M audio-input, $64/1M audio-output tokens. Sibling models: GPT-Realtime-Translate ($0.034/min, 70+ input → 13 output languages) and GPT-Realtime-Whisper for streaming transcription. (2) ElevenLabs Conversational AI is a turn-based voice agent stack — best-in-class TTS voices, lower model intelligence than frontier LLMs, but the most natural-sounding output and the deepest voice cloning. (3) Gemini Live is Google's speech-to-speech offering, integrated into Pixel devices and Google Home (which moved to Gemini 3.1 in May 2026). Strongest for consumer-device-integrated voice and multimodal grounding.

Which voice platform should I pick for a production voice agent in May 2026?

Match the shape of your workload. (1) High-reasoning voice agent that needs tool use (booking, lookups, multi-step work) — GPT-Realtime-2 wins. The 128K context and parallel tool calls are the biggest unlock vs the original Realtime API. (2) Voice quality is the product (voiceover, audiobooks, character voices, accessible apps where the voice is the brand) — ElevenLabs still wins on natural prosody and voice cloning. (3) On-device or Google-ecosystem deployment (Pixel, Android Auto, Google Home) — Gemini Live is the only option that runs natively on those surfaces. Most production teams in May 2026 use GPT-Realtime-2 for the brain and ElevenLabs for premium voice when the default OpenAI voices aren't enough.

What changed when OpenAI took the Realtime API out of beta?

Five things shipped on May 7-8, 2026. (1) GPT-Realtime-2 replaces GPT-Realtime as the flagship — GPT-5-class reasoning, 128K context (was 32K), parallel tool calls, preambles, recovery behavior, adjustable reasoning effort. (2) GPT-Realtime-Translate ships as a dedicated speech-to-speech translator — 70+ input languages, 13 output, $0.034/min, NOT a conversational agent. (3) GPT-Realtime-Whisper ships for streaming transcription. (4) Realtime API exits beta — production SLAs apply. (5) Pricing settles at $32/1M audio-input, $64/1M audio-output for GPT-Realtime-2. Migration from the beta is one model name swap and a context-window check; the SDK is unchanged. Most teams ship the change in a single PR.

Is GPT-Realtime-Translate worth it vs running translation through GPT-Realtime-2?

Yes, for translation-only workloads. GPT-Realtime-Translate is purpose-built for speech-to-speech interpretation — it does not act as a conversational agent, doesn't hold a multi-turn dialog, and is priced at a flat $0.034/minute regardless of speaker volume. For a 60-minute multilingual meeting translating into 13 target languages, that's ~$2 in translation costs vs $20-50 doing the same work through GPT-Realtime-2 with a translation prompt. Use GPT-Realtime-2 when you need the agent to also reason, call tools, or hold a dialog in either language. Use Translate when the only job is convert speech A → speech B. Most production teams now run a router: detect intent → translation jobs go to Translate, conversational jobs go to Realtime-2.

Quick Answer

GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)

Published: May 10, 2026

GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)

OpenAI moved the Realtime API out of beta on May 7, 2026, with three new models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. ElevenLabs and Gemini Live are the other names production voice teams evaluate. Here’s how the three stacks compare for real workloads.

Last verified: May 10, 2026

The three at a glance

Capability	GPT-Realtime-2	ElevenLabs Conversational AI	Gemini Live
Provider	OpenAI	ElevenLabs	Google
Status (May 2026)	GA (May 7)	GA	GA
Architecture	Speech-to-speech	Turn-based ASR + LLM + TTS	Speech-to-speech
Reasoning model	GPT-5-class	Configurable (you bring LLM)	Gemini 3.1 Pro
Context window	128K	Depends on configured LLM	2M (Gemini 3.1 Pro)
Tool calling	Parallel	Yes, via configured LLM	Yes
Voice quality	Excellent	Best-in-class	Very good
Pricing model	Per-token audio	Per-minute + LLM cost	Per-token
Best for	Reasoning + tool-use voice agents	Premium voice apps	Google ecosystem / on-device

What each one actually is

GPT-Realtime-2: the new default for reasoning voice

OpenAI released GPT-Realtime-2 on May 7, 2026, simultaneously taking the Realtime API out of beta. The headline upgrades vs the original GPT-Realtime:

GPT-5-class reasoning in a speech-to-speech loop.
128K context window (up from 32K). Long calls, document grounding, sustained agent tasks.
Parallel tool calls — multiple backend lookups at once, narrated to the user (“let me check that and look up your account at the same time”).
Preambles and recovery — the model says “let me check that” instead of going silent during tool calls; it gracefully recovers from interrupted speech.
Tone control — set conversational tone (warm, professional, energetic) via prompt.
Adjustable reasoning effort — minimal, low, medium, high, xhigh. Higher reasoning = better answers, longer latency.

Pricing: $32 per million audio-input tokens, $64 per million audio-output tokens.

Sibling models shipped same day:

GPT-Realtime-Translate — speech-to-speech interpretation only (NOT conversational), 70+ input languages, 13 output, flat $0.034/minute.
GPT-Realtime-Whisper — streaming transcription only.

ElevenLabs Conversational AI: voice quality is the product

ElevenLabs Conversational AI takes the opposite architecture: bring your own LLM, we handle the voice. You configure the conversation graph and an LLM (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro). ElevenLabs handles ASR, latency optimization, voice generation.

What ElevenLabs still wins on:

Voice realism. ElevenLabs voices sound the most natural. For voiceover-grade output, character voices, audiobooks, accessibility apps where the voice is the brand — nothing else is close.
Voice cloning. Best-in-class for cloning a specific voice (with consent). Critical for branded voice agents, IP-licensed character voices.
Voice library. Thousands of pre-made voices across languages and styles.
Latency profile. Tunable for streaming applications.

Where ElevenLabs trails GPT-Realtime-2: end-to-end speech-to-speech reasoning (the OpenAI model thinks in audio space; ElevenLabs goes through text), tool-call sophistication, context window depth.

Gemini Live: the Google-ecosystem default

Gemini Live runs on Pixel phones, Pixel 10a (launched May 9, 2026 with Tensor G4), Google Home (which moved to Gemini 3.1 in May 2026), Android Auto, and the Gemini app. The model behind Live is Gemini 3.1 Pro.

Where Gemini Live wins:

On-device processing. Tensor G4/G5 silicon runs significant chunks of voice work locally. Privacy and latency benefits.
Multimodal grounding. Live integrates camera, screen content, and search natively in the conversation.
Google Home / Workspace integration. Multi-step home automation, Gmail, Calendar, Docs — first-class.
Free or bundled access for most consumer use.

Where Gemini Live trails: developer API maturity for production agent workloads vs OpenAI Realtime API.

Decision tree: which voice stack?

You’re building an enterprise voice agent (booking, customer support, sales) that needs tool calls and reasoning. → GPT-Realtime-2. The 128K context + parallel tool calls + reasoning_effort knob is the clearest production fit in May 2026.

You’re building a consumer app where voice is the brand (audiobooks, characters, branded assistants, accessibility tools). → ElevenLabs Conversational AI for voice + GPT-5.5 or Claude Opus 4.7 for the brain. Pay the premium per minute; voice quality drives retention.

You’re shipping into the Google ecosystem (Pixel, Android Auto, Google Home, Workspace). → Gemini Live is the only native option. Don’t fight it.

You need real-time multilingual translation (meetings, events, accessibility). → GPT-Realtime-Translate. Flat $0.034/min beats running translation prompts through any of the three above.

You’re transcribing only. → GPT-Realtime-Whisper for streaming, batch Whisper for offline.

What changed in May 2026

May 7: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper released. Realtime API out of beta.
May 5: Google Home moved to Gemini 3.1 for Gemini for Home assistant.
May 9: Pixel 10a launched with Tensor G4 + on-device Gemini Live features.

Migration notes if you’re on the beta Realtime API

Rename model in your session config from gpt-realtime to gpt-realtime-2.
Verify your context — long sessions that previously had to chunk at 32K can now run to 128K natively.
Add reasoning_effort parameter — set to medium as a default; bump to high for harder workflows; minimal for latency-sensitive simple flows.
Use parallel tool calls if your existing tools are independent — measurable latency win.
Add preamble handling on the client — render “let me check that” text while audio is generating.

Most teams report a single PR migration. The big upgrades (128K, parallel tools, reasoning_effort) are feature-flag-style additions, not breaking changes.

What to watch next

OpenAI Realtime API SLA — first published SLA numbers post-GA.
ElevenLabs response — counter-pricing or latency improvements.
Gemini Live developer API maturity — Google’s I/O 2026 (May 19) is likely to expand the Live developer story.
Multimodal voice agents — combining camera/screen + voice + tool use is the next frontier; all three providers are racing.

Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook (realtime translation), TheNextWeb, MarktechPost, Latent.Space, Neowin, 9to5Mac, Microsoft Azure AI Foundry blog, Google Home update notes.

GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)

The three at a glance

What each one actually is

GPT-Realtime-2: the new default for reasoning voice

ElevenLabs Conversational AI: voice quality is the product

Gemini Live: the Google-ecosystem default

Decision tree: which voice stack?

What changed in May 2026

Migration notes if you’re on the beta Realtime API

What to watch next

Related reading