How do I migrate from the Realtime API beta to GPT-Realtime-2?

Five steps, typically a single PR. (1) Rename the model in your session config from 'gpt-realtime' to 'gpt-realtime-2'. (2) Add a 'reasoning_effort' parameter — 'medium' is the sensible default; use 'minimal' for latency-critical flows like FAQ lookup, and 'high' or 'xhigh' for complex enterprise reasoning. (3) Enable parallel tool calls if your tools are independent — measurable 30-50% latency win on multi-tool turns. (4) Render the preamble text on your client UI while audio is generating so the user sees 'let me check that' instead of dead air. (5) Verify your context window assumptions — long sessions that previously chunked at 32K can now run native to 128K. Most teams ship the change in an afternoon.

Will my existing Realtime API code break?

No, the SDK contract is unchanged. The migration is feature-flag-style — you change the model name and optionally add new parameters (reasoning_effort, parallel tool calls). All existing session configuration, tool definitions, audio formats, and event handling continue to work. The breaking-change check items are: (1) verify any code that hardcoded the 32K context window assumption — you can now use up to 128K. (2) check pricing — GPT-Realtime-2 is priced at $32 per million audio-input tokens and $64 per million audio-output tokens; if you had budget alerts pinned to old pricing, update them. (3) production SLAs — the Realtime API is now GA, so contractual SLA clauses apply where they didn't during beta.

When should I use GPT-Realtime-Translate or GPT-Realtime-Whisper instead of GPT-Realtime-2?

Pick by job. (1) GPT-Realtime-Translate ($0.034/minute flat) — when the only task is convert speech in language A to speech in language B. 70+ input languages, 13 output. NOT a conversational agent — it doesn't reason or hold dialog state. Use it for live event interpretation, multilingual meeting captions, accessibility translation. ~10x cheaper than running translation through GPT-Realtime-2. (2) GPT-Realtime-Whisper — when you only need streaming transcription, no synthesis. Cheaper than Realtime-2. (3) GPT-Realtime-2 — when you need conversation, reasoning, tool calls, dialog state. Most production teams now run a small router: detect intent → translation jobs go to Translate, transcription-only jobs go to Whisper, conversational jobs go to Realtime-2.

How do I tune reasoning_effort for my use case?

Five levels with concrete use cases. (1) minimal — drive-thru order-taking, FAQ lookup, simple voice commands, smart-home control. Sub-second time-to-first-audio. Use when latency is the user-facing metric. (2) low — basic IVR replacement, lightweight customer service. (3) medium (default) — most consumer voice agents, standard customer support, scheduling, account lookups. The right starting point. (4) high — enterprise workflows with multi-step reasoning, complex troubleshooting, scheduling under multiple constraints. Acceptable latency hit, materially better answers. (5) xhigh — hardest enterprise reasoning, regulated workflows where correctness dominates speed. Default to medium, profile your top 20 conversation patterns, bump up only the patterns that benefit from harder reasoning. Don't set xhigh globally — you'll burn money and frustrate users on simple turns.

Quick Answer

How to Migrate to GPT-Realtime-2 from Realtime API Beta (2026)

Published: May 10, 2026

How to Migrate to GPT-Realtime-2 from Realtime API Beta (2026)

OpenAI moved the Realtime API out of beta on May 7, 2026 with GPT-Realtime-2 as the new flagship voice model. The migration is short — typically a single PR — but the new features (128K context, parallel tool calls, reasoning_effort, preambles) are worth tuning. Here’s the step-by-step.

Last verified: May 10, 2026

Why migrate

Capability	Beta GPT-Realtime	GPT-Realtime-2
Reasoning class	Light reasoning	GPT-5-class
Context window	32K	128K
Tool calls	Sequential only	Parallel
Preambles / recovery	None	First-class
Reasoning effort	Fixed	Adjustable (5 levels)
API status	Beta (no SLA)	GA (production SLA)

Three things justify the migration on their own:

128K context. Long calls and document grounding without chunking workarounds.
Parallel tool calls. 30-50% latency reduction on multi-tool turns in production benchmarks.
GA SLA. If you ship to customers, the production SLA matters.

The five-step migration

Step 1: Rename the model

In your session config, change:

{
  "session": {
-   "model": "gpt-realtime",
+   "model": "gpt-realtime-2",
    ...
  }
}

That’s the minimum viable migration. Everything else is upgrades that take advantage of new capabilities.

Step 2: Set reasoning_effort

Add the new reasoning_effort parameter. Five levels:

Level	Use case	Latency
minimal	FAQ lookup, drive-thru ordering, smart-home control	Sub-second TTFA
low	Basic IVR replacement, lightweight CS	Fast
medium (default)	Standard CS, scheduling, account lookups	Balanced
high	Enterprise workflows, complex troubleshooting	Slower, better answers
xhigh	Regulated workflows, hardest reasoning	Slowest, best answers

{
  "session": {
    "model": "gpt-realtime-2",
    "reasoning_effort": "medium",
    ...
  }
}

Default to medium. Profile your top 20 conversation patterns. Bump specific patterns to high/xhigh only when correctness dominates speed. Bump down to minimal/low when latency dominates.

Step 3: Enable parallel tool calls

If your tools are independent (don’t depend on each other’s outputs in the same turn), enable parallel:

{
  "session": {
    "model": "gpt-realtime-2",
    "reasoning_effort": "medium",
    "tools": [
      {"type": "function", "function": {...}},
      {"type": "function", "function": {...}}
    ],
    "tool_choice": "auto",
    "parallel_tool_calls": true
  }
}

Concrete win: if a customer-support agent fetches account info, recent orders, and policy docs in the same turn, parallel cuts the round-trip from sequential (3x latency) to parallel (1x latency).

Don’t enable parallel tool calls if your tools have dependencies (the second call needs the result of the first). Keep those sequential.

Step 4: Render preamble text on the client

GPT-Realtime-2 emits “preambles” — short utterances like “let me check that” while it generates the actual answer or runs tool calls. Render these on your client UI:

session.on('response.audio_transcript.delta', (delta) => {
  if (delta.preamble) {
    showPreambleText(delta.text); // "let me check that"
  } else {
    appendTranscript(delta.text);
  }
});

Without rendering preambles, your UI looks like dead air during long tool calls. With preambles rendered, the agent feels alive even on slow turns.

Step 5: Verify your context window assumptions

The 32K → 128K context window expansion removes the most common production workaround. Audit your code for:

Conversation truncation logic — if you were aggressively dropping old turns to stay under 32K, relax it. Hold full call history for sessions over an hour.
Document grounding chunking — if you were chunking long manuals or knowledge bases to stay in budget, simplify. Inline grounding for typical knowledge bases now fits.
Multi-turn agent state — long-running agent loops can keep more context across turns without manual summarization.

Don’t blindly send 128K every turn — you’ll burn audio token cost. But you no longer need workarounds for 30K+ contexts.

Pricing changes to budget for

Item	Price
GPT-Realtime-2 audio input	$32 / 1M tokens
GPT-Realtime-2 audio output	$64 / 1M tokens
GPT-Realtime-Translate	$0.034 / minute (flat)
GPT-Realtime-Whisper	Whisper streaming pricing

Practical cost rule of thumb for GPT-Realtime-2: $0.30-0.60 per minute of two-way conversation, depending on speech density and tool call frequency.

If your alerts were pinned to beta pricing, update them.

Add a small router for the three audio models

Most production teams now route between the three:

def route_audio_request(intent: str, content: str) -> str:
    if intent == "translation_only":
        return "gpt-realtime-translate"  # $0.034/min flat
    if intent == "transcription_only":
        return "gpt-realtime-whisper"
    return "gpt-realtime-2"  # default conversational

For multilingual meeting tools, this single router can cut translation costs by ~10x vs running everything through Realtime-2.

Common migration mistakes

Setting reasoning_effort=xhigh globally. Burns money on simple turns. Default to medium, bump per-pattern.
Enabling parallel tool calls when tools have dependencies. Causes wrong-order calls. Audit your tool dependency graph first.
Forgetting to render preamble text. UI feels dead during long tool calls.
Not updating budget alerts. New pricing differs from beta; existing budget alerts may misfire.
Skipping the production SLA review. GA means SLA clauses now apply; legal/procurement may need a re-review.

What to do after migration

Set up A/B logs comparing latency and customer satisfaction pre/post migration.
Profile your top 20 conversation patterns and tune reasoning_effort per pattern.
Add the audio model router so translation and transcription jobs go to cheaper purpose-built models.
Update internal docs — this is now a GA dependency, not beta R&D.

What to watch next

OpenAI Realtime API SLA numbers — first published uptime data post-GA.
New voice catalog additions through Q3 2026.
Agent SDK improvements that take advantage of parallel tool calls automatically.
Multimodal voice agents — combining camera/screen + voice + tool use.

Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook (realtime translation guide), MarktechPost, TheNextWeb, Latent.Space, OpenAI community forum.

How to Migrate to GPT-Realtime-2 from Realtime API Beta (2026)

Why migrate

The five-step migration

Step 1: Rename the model

Step 2: Set reasoning_effort

Step 3: Enable parallel tool calls

Step 4: Render preamble text on the client

Step 5: Verify your context window assumptions

Pricing changes to budget for

Add a small router for the three audio models

Common migration mistakes

What to do after migration

What to watch next

Related reading