GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)
GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)
OpenAI moved the Realtime API out of beta on May 7, 2026, with three new models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. ElevenLabs and Gemini Live are the other names production voice teams evaluate. Here’s how the three stacks compare for real workloads.
Last verified: May 10, 2026
The three at a glance
| Capability | GPT-Realtime-2 | ElevenLabs Conversational AI | Gemini Live |
|---|---|---|---|
| Provider | OpenAI | ElevenLabs | |
| Status (May 2026) | GA (May 7) | GA | GA |
| Architecture | Speech-to-speech | Turn-based ASR + LLM + TTS | Speech-to-speech |
| Reasoning model | GPT-5-class | Configurable (you bring LLM) | Gemini 3.1 Pro |
| Context window | 128K | Depends on configured LLM | 2M (Gemini 3.1 Pro) |
| Tool calling | Parallel | Yes, via configured LLM | Yes |
| Voice quality | Excellent | Best-in-class | Very good |
| Pricing model | Per-token audio | Per-minute + LLM cost | Per-token |
| Best for | Reasoning + tool-use voice agents | Premium voice apps | Google ecosystem / on-device |
What each one actually is
GPT-Realtime-2: the new default for reasoning voice
OpenAI released GPT-Realtime-2 on May 7, 2026, simultaneously taking the Realtime API out of beta. The headline upgrades vs the original GPT-Realtime:
- GPT-5-class reasoning in a speech-to-speech loop.
- 128K context window (up from 32K). Long calls, document grounding, sustained agent tasks.
- Parallel tool calls — multiple backend lookups at once, narrated to the user (“let me check that and look up your account at the same time”).
- Preambles and recovery — the model says “let me check that” instead of going silent during tool calls; it gracefully recovers from interrupted speech.
- Tone control — set conversational tone (warm, professional, energetic) via prompt.
- Adjustable reasoning effort — minimal, low, medium, high, xhigh. Higher reasoning = better answers, longer latency.
Pricing: $32 per million audio-input tokens, $64 per million audio-output tokens.
Sibling models shipped same day:
- GPT-Realtime-Translate — speech-to-speech interpretation only (NOT conversational), 70+ input languages, 13 output, flat $0.034/minute.
- GPT-Realtime-Whisper — streaming transcription only.
ElevenLabs Conversational AI: voice quality is the product
ElevenLabs Conversational AI takes the opposite architecture: bring your own LLM, we handle the voice. You configure the conversation graph and an LLM (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro). ElevenLabs handles ASR, latency optimization, voice generation.
What ElevenLabs still wins on:
- Voice realism. ElevenLabs voices sound the most natural. For voiceover-grade output, character voices, audiobooks, accessibility apps where the voice is the brand — nothing else is close.
- Voice cloning. Best-in-class for cloning a specific voice (with consent). Critical for branded voice agents, IP-licensed character voices.
- Voice library. Thousands of pre-made voices across languages and styles.
- Latency profile. Tunable for streaming applications.
Where ElevenLabs trails GPT-Realtime-2: end-to-end speech-to-speech reasoning (the OpenAI model thinks in audio space; ElevenLabs goes through text), tool-call sophistication, context window depth.
Gemini Live: the Google-ecosystem default
Gemini Live runs on Pixel phones, Pixel 10a (launched May 9, 2026 with Tensor G4), Google Home (which moved to Gemini 3.1 in May 2026), Android Auto, and the Gemini app. The model behind Live is Gemini 3.1 Pro.
Where Gemini Live wins:
- On-device processing. Tensor G4/G5 silicon runs significant chunks of voice work locally. Privacy and latency benefits.
- Multimodal grounding. Live integrates camera, screen content, and search natively in the conversation.
- Google Home / Workspace integration. Multi-step home automation, Gmail, Calendar, Docs — first-class.
- Free or bundled access for most consumer use.
Where Gemini Live trails: developer API maturity for production agent workloads vs OpenAI Realtime API.
Decision tree: which voice stack?
You’re building an enterprise voice agent (booking, customer support, sales) that needs tool calls and reasoning. → GPT-Realtime-2. The 128K context + parallel tool calls + reasoning_effort knob is the clearest production fit in May 2026.
You’re building a consumer app where voice is the brand (audiobooks, characters, branded assistants, accessibility tools). → ElevenLabs Conversational AI for voice + GPT-5.5 or Claude Opus 4.7 for the brain. Pay the premium per minute; voice quality drives retention.
You’re shipping into the Google ecosystem (Pixel, Android Auto, Google Home, Workspace). → Gemini Live is the only native option. Don’t fight it.
You need real-time multilingual translation (meetings, events, accessibility). → GPT-Realtime-Translate. Flat $0.034/min beats running translation prompts through any of the three above.
You’re transcribing only. → GPT-Realtime-Whisper for streaming, batch Whisper for offline.
What changed in May 2026
- May 7: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper released. Realtime API out of beta.
- May 5: Google Home moved to Gemini 3.1 for Gemini for Home assistant.
- May 9: Pixel 10a launched with Tensor G4 + on-device Gemini Live features.
Migration notes if you’re on the beta Realtime API
- Rename model in your session config from
gpt-realtimetogpt-realtime-2. - Verify your context — long sessions that previously had to chunk at 32K can now run to 128K natively.
- Add
reasoning_effortparameter — set tomediumas a default; bump tohighfor harder workflows;minimalfor latency-sensitive simple flows. - Use parallel tool calls if your existing tools are independent — measurable latency win.
- Add preamble handling on the client — render “let me check that” text while audio is generating.
Most teams report a single PR migration. The big upgrades (128K, parallel tools, reasoning_effort) are feature-flag-style additions, not breaking changes.
What to watch next
- OpenAI Realtime API SLA — first published SLA numbers post-GA.
- ElevenLabs response — counter-pricing or latency improvements.
- Gemini Live developer API maturity — Google’s I/O 2026 (May 19) is likely to expand the Live developer story.
- Multimodal voice agents — combining camera/screen + voice + tool use is the next frontier; all three providers are racing.
Related reading
- Best AI voice agent platforms
- Best AI voice agent stack for SMB service businesses
- GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro
Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook (realtime translation), TheNextWeb, MarktechPost, Latent.Space, Neowin, 9to5Mac, Microsoft Azure AI Foundry blog, Google Home update notes.