AI agents · OpenClaw · self-hosting · automation

Quick Answer

GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)

Published:

GPT-Realtime-2 vs ElevenLabs vs Gemini Live (May 2026)

OpenAI moved the Realtime API out of beta on May 7, 2026, with three new models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. ElevenLabs and Gemini Live are the other names production voice teams evaluate. Here’s how the three stacks compare for real workloads.

Last verified: May 10, 2026

The three at a glance

CapabilityGPT-Realtime-2ElevenLabs Conversational AIGemini Live
ProviderOpenAIElevenLabsGoogle
Status (May 2026)GA (May 7)GAGA
ArchitectureSpeech-to-speechTurn-based ASR + LLM + TTSSpeech-to-speech
Reasoning modelGPT-5-classConfigurable (you bring LLM)Gemini 3.1 Pro
Context window128KDepends on configured LLM2M (Gemini 3.1 Pro)
Tool callingParallelYes, via configured LLMYes
Voice qualityExcellentBest-in-classVery good
Pricing modelPer-token audioPer-minute + LLM costPer-token
Best forReasoning + tool-use voice agentsPremium voice appsGoogle ecosystem / on-device

What each one actually is

GPT-Realtime-2: the new default for reasoning voice

OpenAI released GPT-Realtime-2 on May 7, 2026, simultaneously taking the Realtime API out of beta. The headline upgrades vs the original GPT-Realtime:

  • GPT-5-class reasoning in a speech-to-speech loop.
  • 128K context window (up from 32K). Long calls, document grounding, sustained agent tasks.
  • Parallel tool calls — multiple backend lookups at once, narrated to the user (“let me check that and look up your account at the same time”).
  • Preambles and recovery — the model says “let me check that” instead of going silent during tool calls; it gracefully recovers from interrupted speech.
  • Tone control — set conversational tone (warm, professional, energetic) via prompt.
  • Adjustable reasoning effort — minimal, low, medium, high, xhigh. Higher reasoning = better answers, longer latency.

Pricing: $32 per million audio-input tokens, $64 per million audio-output tokens.

Sibling models shipped same day:

  • GPT-Realtime-Translate — speech-to-speech interpretation only (NOT conversational), 70+ input languages, 13 output, flat $0.034/minute.
  • GPT-Realtime-Whisper — streaming transcription only.

ElevenLabs Conversational AI: voice quality is the product

ElevenLabs Conversational AI takes the opposite architecture: bring your own LLM, we handle the voice. You configure the conversation graph and an LLM (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro). ElevenLabs handles ASR, latency optimization, voice generation.

What ElevenLabs still wins on:

  • Voice realism. ElevenLabs voices sound the most natural. For voiceover-grade output, character voices, audiobooks, accessibility apps where the voice is the brand — nothing else is close.
  • Voice cloning. Best-in-class for cloning a specific voice (with consent). Critical for branded voice agents, IP-licensed character voices.
  • Voice library. Thousands of pre-made voices across languages and styles.
  • Latency profile. Tunable for streaming applications.

Where ElevenLabs trails GPT-Realtime-2: end-to-end speech-to-speech reasoning (the OpenAI model thinks in audio space; ElevenLabs goes through text), tool-call sophistication, context window depth.

Gemini Live: the Google-ecosystem default

Gemini Live runs on Pixel phones, Pixel 10a (launched May 9, 2026 with Tensor G4), Google Home (which moved to Gemini 3.1 in May 2026), Android Auto, and the Gemini app. The model behind Live is Gemini 3.1 Pro.

Where Gemini Live wins:

  • On-device processing. Tensor G4/G5 silicon runs significant chunks of voice work locally. Privacy and latency benefits.
  • Multimodal grounding. Live integrates camera, screen content, and search natively in the conversation.
  • Google Home / Workspace integration. Multi-step home automation, Gmail, Calendar, Docs — first-class.
  • Free or bundled access for most consumer use.

Where Gemini Live trails: developer API maturity for production agent workloads vs OpenAI Realtime API.

Decision tree: which voice stack?

You’re building an enterprise voice agent (booking, customer support, sales) that needs tool calls and reasoning. → GPT-Realtime-2. The 128K context + parallel tool calls + reasoning_effort knob is the clearest production fit in May 2026.

You’re building a consumer app where voice is the brand (audiobooks, characters, branded assistants, accessibility tools). → ElevenLabs Conversational AI for voice + GPT-5.5 or Claude Opus 4.7 for the brain. Pay the premium per minute; voice quality drives retention.

You’re shipping into the Google ecosystem (Pixel, Android Auto, Google Home, Workspace). → Gemini Live is the only native option. Don’t fight it.

You need real-time multilingual translation (meetings, events, accessibility). → GPT-Realtime-Translate. Flat $0.034/min beats running translation prompts through any of the three above.

You’re transcribing only. → GPT-Realtime-Whisper for streaming, batch Whisper for offline.

What changed in May 2026

  • May 7: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper released. Realtime API out of beta.
  • May 5: Google Home moved to Gemini 3.1 for Gemini for Home assistant.
  • May 9: Pixel 10a launched with Tensor G4 + on-device Gemini Live features.

Migration notes if you’re on the beta Realtime API

  1. Rename model in your session config from gpt-realtime to gpt-realtime-2.
  2. Verify your context — long sessions that previously had to chunk at 32K can now run to 128K natively.
  3. Add reasoning_effort parameter — set to medium as a default; bump to high for harder workflows; minimal for latency-sensitive simple flows.
  4. Use parallel tool calls if your existing tools are independent — measurable latency win.
  5. Add preamble handling on the client — render “let me check that” text while audio is generating.

Most teams report a single PR migration. The big upgrades (128K, parallel tools, reasoning_effort) are feature-flag-style additions, not breaking changes.

What to watch next

  • OpenAI Realtime API SLA — first published SLA numbers post-GA.
  • ElevenLabs response — counter-pricing or latency improvements.
  • Gemini Live developer API maturity — Google’s I/O 2026 (May 19) is likely to expand the Live developer story.
  • Multimodal voice agents — combining camera/screen + voice + tool use is the next frontier; all three providers are racing.

Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook (realtime translation), TheNextWeb, MarktechPost, Latent.Space, Neowin, 9to5Mac, Microsoft Azure AI Foundry blog, Google Home update notes.