How to Migrate to GPT-Realtime-2 from Realtime API Beta (2026)
How to Migrate to GPT-Realtime-2 from Realtime API Beta (2026)
OpenAI moved the Realtime API out of beta on May 7, 2026 with GPT-Realtime-2 as the new flagship voice model. The migration is short — typically a single PR — but the new features (128K context, parallel tool calls, reasoning_effort, preambles) are worth tuning. Here’s the step-by-step.
Last verified: May 10, 2026
Why migrate
| Capability | Beta GPT-Realtime | GPT-Realtime-2 |
|---|---|---|
| Reasoning class | Light reasoning | GPT-5-class |
| Context window | 32K | 128K |
| Tool calls | Sequential only | Parallel |
| Preambles / recovery | None | First-class |
| Reasoning effort | Fixed | Adjustable (5 levels) |
| API status | Beta (no SLA) | GA (production SLA) |
Three things justify the migration on their own:
- 128K context. Long calls and document grounding without chunking workarounds.
- Parallel tool calls. 30-50% latency reduction on multi-tool turns in production benchmarks.
- GA SLA. If you ship to customers, the production SLA matters.
The five-step migration
Step 1: Rename the model
In your session config, change:
{
"session": {
- "model": "gpt-realtime",
+ "model": "gpt-realtime-2",
...
}
}
That’s the minimum viable migration. Everything else is upgrades that take advantage of new capabilities.
Step 2: Set reasoning_effort
Add the new reasoning_effort parameter. Five levels:
| Level | Use case | Latency |
|---|---|---|
| minimal | FAQ lookup, drive-thru ordering, smart-home control | Sub-second TTFA |
| low | Basic IVR replacement, lightweight CS | Fast |
| medium (default) | Standard CS, scheduling, account lookups | Balanced |
| high | Enterprise workflows, complex troubleshooting | Slower, better answers |
| xhigh | Regulated workflows, hardest reasoning | Slowest, best answers |
{
"session": {
"model": "gpt-realtime-2",
"reasoning_effort": "medium",
...
}
}
Default to medium. Profile your top 20 conversation patterns. Bump specific patterns to high/xhigh only when correctness dominates speed. Bump down to minimal/low when latency dominates.
Step 3: Enable parallel tool calls
If your tools are independent (don’t depend on each other’s outputs in the same turn), enable parallel:
{
"session": {
"model": "gpt-realtime-2",
"reasoning_effort": "medium",
"tools": [
{"type": "function", "function": {...}},
{"type": "function", "function": {...}}
],
"tool_choice": "auto",
"parallel_tool_calls": true
}
}
Concrete win: if a customer-support agent fetches account info, recent orders, and policy docs in the same turn, parallel cuts the round-trip from sequential (3x latency) to parallel (1x latency).
Don’t enable parallel tool calls if your tools have dependencies (the second call needs the result of the first). Keep those sequential.
Step 4: Render preamble text on the client
GPT-Realtime-2 emits “preambles” — short utterances like “let me check that” while it generates the actual answer or runs tool calls. Render these on your client UI:
session.on('response.audio_transcript.delta', (delta) => {
if (delta.preamble) {
showPreambleText(delta.text); // "let me check that"
} else {
appendTranscript(delta.text);
}
});
Without rendering preambles, your UI looks like dead air during long tool calls. With preambles rendered, the agent feels alive even on slow turns.
Step 5: Verify your context window assumptions
The 32K → 128K context window expansion removes the most common production workaround. Audit your code for:
- Conversation truncation logic — if you were aggressively dropping old turns to stay under 32K, relax it. Hold full call history for sessions over an hour.
- Document grounding chunking — if you were chunking long manuals or knowledge bases to stay in budget, simplify. Inline grounding for typical knowledge bases now fits.
- Multi-turn agent state — long-running agent loops can keep more context across turns without manual summarization.
Don’t blindly send 128K every turn — you’ll burn audio token cost. But you no longer need workarounds for 30K+ contexts.
Pricing changes to budget for
| Item | Price |
|---|---|
| GPT-Realtime-2 audio input | $32 / 1M tokens |
| GPT-Realtime-2 audio output | $64 / 1M tokens |
| GPT-Realtime-Translate | $0.034 / minute (flat) |
| GPT-Realtime-Whisper | Whisper streaming pricing |
Practical cost rule of thumb for GPT-Realtime-2: $0.30-0.60 per minute of two-way conversation, depending on speech density and tool call frequency.
If your alerts were pinned to beta pricing, update them.
Add a small router for the three audio models
Most production teams now route between the three:
def route_audio_request(intent: str, content: str) -> str:
if intent == "translation_only":
return "gpt-realtime-translate" # $0.034/min flat
if intent == "transcription_only":
return "gpt-realtime-whisper"
return "gpt-realtime-2" # default conversational
For multilingual meeting tools, this single router can cut translation costs by ~10x vs running everything through Realtime-2.
Common migration mistakes
- Setting reasoning_effort=xhigh globally. Burns money on simple turns. Default to medium, bump per-pattern.
- Enabling parallel tool calls when tools have dependencies. Causes wrong-order calls. Audit your tool dependency graph first.
- Forgetting to render preamble text. UI feels dead during long tool calls.
- Not updating budget alerts. New pricing differs from beta; existing budget alerts may misfire.
- Skipping the production SLA review. GA means SLA clauses now apply; legal/procurement may need a re-review.
What to do after migration
- Set up A/B logs comparing latency and customer satisfaction pre/post migration.
- Profile your top 20 conversation patterns and tune reasoning_effort per pattern.
- Add the audio model router so translation and transcription jobs go to cheaper purpose-built models.
- Update internal docs — this is now a GA dependency, not beta R&D.
What to watch next
- OpenAI Realtime API SLA numbers — first published uptime data post-GA.
- New voice catalog additions through Q3 2026.
- Agent SDK improvements that take advantage of parallel tool calls automatically.
- Multimodal voice agents — combining camera/screen + voice + tool use.
Related reading
- What is GPT-Realtime-2? OpenAI’s new voice model
- GPT-Realtime-2 vs ElevenLabs vs Gemini Live
- Best AI voice agent platforms
Last verified: May 10, 2026 — sources: OpenAI Realtime API release notes, OpenAI cookbook (realtime translation guide), MarktechPost, TheNextWeb, Latent.Space, OpenAI community forum.