GPT-Realtime-2 GA May 7, 2026: 128k, reasoning, latency

On May 7, 2026, OpenAI rolled out the red carpet for GPT-Realtime-2: its first GPT-5 voice model with speech-to-speech reasoning, 128k token context, and a reasoning_effort lever with five levels.

The Realtime API exited beta on the same day, introducing two new voices (Cedar and Marin) and two companion models: Translate and Whisper.

For teams considering a migration from GPT-Realtime-1.5, the real question isn’t the press release, but what the code, billing, and latency will look like come Monday morning.

In brief

GA on May 7, 2026: the Realtime API leaves beta, preview models gpt-4o-realtime are discontinued the same day, the path becomes gpt-realtime-2.
Context 32k to 128k: 1 to 2 hours of dense conversation without reset, maximum output 32k tokens.
reasoning_effort in 5 levels: minimal, low (default), medium, high, xhigh, balancing latency vs quality at the session level.
Audio pricing unchanged: 32 / 64 dollars / M tokens, cache at 0.40 dollars / M, Translate 0.034 dollars / min, Whisper 0.017 dollars / min.
Field latency: 1.12 s in minimal, 2.33 s in high (Artificial Analysis), 2 to 4 s in median production.
Targeted migration: model ID change, addition of type field in session.update, events renamed to output_text/output_audio.

What OpenAI announced for GPT-Realtime-2 on May 7, 2026

OpenAI isn’t launching a voice model from scratch: it’s moving a stack that has been maturing in beta since February 2025 to GA.

The Realtime API shifts to general availability, and the preview models gpt-4o-realtime-preview are withdrawn the same day without leniency, with the default path becoming gpt-realtime-2.

Any request sent with the header OpenAI-Beta: realtime=v1 will fail after this transition, according to OpenAI’s deprecation page.

Cedar, Marin, and the Translate / Whisper family

Native speech-to-speech carries prosody, where an STT to LLM to TTS pipeline loses it between stages: it’s the difference between a phone call and postal mail.

Two new voices join the Realtime API exclusively: Cedar and Marin, tailored for the support voice agent.

Three models are launched in parallel: GPT-Realtime-2 for conversation, GPT-Realtime-Translate for live translation (70 source languages, 13 targets), GPT-Realtime-Whisper for streaming transcription.

Translate is not a conversational model: it transforms one audio stream into another without model turn-taking, creating an entire audio platform rather than an isolated model.

Official pricing per model

Audio pricing remains the same as the previous GPT-Realtime-1.5: 32 dollars / M audio input tokens, 64 dollars / M output tokens, 0.40 dollars / M for the cache.

For text: 4 dollars input, 24 dollars output, cache at 0.40 dollars / M, plus 5 dollars / M for images now accepted as input.

Translate is billed at 0.034 dollars / minute, Whisper at 0.017 dollars / minute.

The dual pricing unit (token versus minute) requires cost-per-conversation calculations: a typical session burns 800 audio input tokens and 1,200 audio output tokens per minute, costing 0.10 to 0.15 dollars / minute excluding cache.

The real technical leaps of GPT-Realtime-2 compared to Realtime-1.5

Three levers stand out: the context window, the reasoning setting, and the output of verbal preambles, which determine what GPT-Realtime-2 retains in a session and what it charges for the same task.

Context 32k to 128k: what it unlocks

The window quadruples from 32k to 128k tokens, with output capped at 32k tokens per session, allowing 1 to 2 hours of dense conversation in active memory without reset.

The 128k isn’t a free gain: each token consumed inflates the audio bill to 32 / 64 dollars / M.

The truncation token limit, set at the session level, remains good practice beyond 10 minutes: it cuts old turns in bulk rather than by message.

Golden baroque organ with five brass levers on a keyboard veiled in scarlet smoke

reasoning_effort in 5 levels and verbal preambles

The reasoning_effort lever accepts five values: minimal, low (default), medium, high, xhigh, like a gearbox where you only shift to xhigh on a steep slope.

Artificial Analysis measures 1.12 s time-to-first-audio in minimal, 2.33 s in high: the decision to increase the setting directly impacts perceived latency.

Verbal preambles address this cost: the model says “let me check that” while a tool call runs, similar to a human typing silently at a keyboard.

The optimal default is low for 80% of the flow, escalating to high if the turn requires a chain of inference (calculation, multi-step reasoning, document verification).

Barge-in, parallel tool calls, and MCP

GPT-Realtime-2 supports three parallel uses that GPT-Realtime-1.5 tolerated at the cost of heavy application plumbing.

Parallel tool calls run within the same turn: the model can query both calendar and CRM simultaneously without blocking the voice.

Asynchronous function calling keeps the conversation fluid while a long API call loads in the background.

session.update accepts a remote MCP server URL, the platform wires the tool calls for you: a shared closet among several agents versus a drawer stuck to the app.

Reading GPT-Realtime-2 benchmarks with perspective

On Big Bench Audio, GPT-Realtime-2 in high reasoning scores 96.6% versus 81.4% for Realtime-1.5: a 15.2-point gain.

On Audio MultiChallenge in xhigh, 48.5% versus 34.7%.

The Big Bench Audio leap is measurable, but the 96% score indicates a saturation of the test bench.

Audio MultiChallenge tells a different story: the multi-turn remains an open problem, with less than one in two turns passing in xhigh.

The previous generation, measured at the end of 2024 against GPT-Realtime-1.5, went from 20.6% to 30.5% on this bench, and the ComplexFuncBench audio (vocal function calling) from 49.7% to 66.5%.

Regarding latency, OpenAI does not provide official figures: Artificial Analysis reports 1.12 s in minimal and 2.33 s in high, with field feedback up to 3.4 s on long sessions in xhigh.

This range contrasts with the Deepgram + Llama + Cartesia cascade (500-800 ms) or with Gemini 3.1 Flash Live (250-500 ms): native speech-to-speech doesn’t win the millisecond race, it wins in prosody and integration simplicity.

AI voice comparison in May 2026 against competitors

The AI voice market in May 2026 isn’t just about choosing a model, it’s an architectural decision sorted by latency, pricing, and vendor lock-in.

GPT-Realtime-2 (OpenAI): native speech-to-speech, latency 1.1-3.4 s, 0.15 to 0.20 dollars / minute, high lock-in, 128k context, native MCP and SIP.
Gemini 3.1 Flash Live (Google): native speech-to-speech, 250-500 ms in good conditions, economical audio token pricing at high volume.
Cartesia Sonic-3: State Space Model, TTFA 90 ms, 46.70 dollars / M characters, unbeatable on pure latency in cascade.
Deepgram Voice Agent: Nova-3 STT + Aura TTS bundle at 4.50 dollars / hour, sub-700 ms end-to-end.
ElevenLabs Conversational AI: Flash v2.5 TTS at 75 ms, premium quality, credit-based that skyrockets beyond 10,000 minutes / month.
AWS Nova Sonic: alternative speech-to-speech positioned on pricing, latency close to Realtime-2.

GPT-Realtime-2 shines for projects requiring multi-turn reasoning and rich tool use, where the cascade remains more economical for basic containment at high volume.

Tower of brass megaphones on baroque scarlet drapery

Migrating from GPT-Realtime-1.5 to GPT-Realtime-2 without breaking everything

The migration from GPT-Realtime-1.5 to GPT-Realtime-2 is not wire-compatible with the beta: changes are localized but blocking if ignored.

Identifier and new session parameters

The first patch is two lines: replace gpt-4o-realtime-preview or gpt-realtime with gpt-realtime-2, and remove the header OpenAI-Beta: realtime=v1 from all requests.

The session.update payload now requires a type field with two possible values: realtime for speech-to-speech, transcription for pure transcription.

Events have been renamed: response.text.delta becomes response.output_text.delta, response.audio.delta becomes response.output_audio.delta, conversation.item.created is replaced by conversation.item.added and conversation.item.done.

Wiring Translate, Whisper, and transport

The GPT-Realtime-2 family remains modular: to add live translation, the audio stream goes through Translate via the dedicated endpoint /v1/realtime/translations, without a response.create call.

For pure transcription, Whisper streaming runs on a separate transcription session.

On the transport side, WebRTC remains the default for browsers and mobile, WebSocket serves server pipelines, SIP connects IP phones and the PSTN.

Creating ephemeral keys switches to POST /v1/realtime/client_secrets, WebRTC SDP shifts to /v1/realtime/calls, SIP REFER allows programmed call transfers to a human without custom middleware.

GPT-Realtime-2 use cases in production and residual limits

Zillow reports a 26-point jump on its adversarial benchmark after prompt optimization on GPT-Realtime-2, the kind of operational metric that outweighs a demo clip.

OpenAI named four pilot clients during the May 7, 2026 livestream: Deutsche Telekom on a multilingual support agent with native code-switching German-English-Turkish, Zillow with 26 points adversarial gain after prompt optimization, Priceline linked to Translate for four languages, Vimeo on live note-taking for creator support.

The containment rate, the key metric for mature voicebots, ranges from 40% to 70% depending on verticals: below 40%, the ROI argument collapses.

For French, instruction following is sometimes ignored on Cedar and Marin (already noted on Realtime-1.5), where OpenAI’s historical GPT voice already carried tight Francophone accents.

Vocal hallucinations and prompt injection (hidden instruction in incoming audio) remain active risks: SplxAI documents several cases of bypassing textual safeguards through vocal injection.

Conclusion: what GPT-Realtime-2 changes for your voice stack

GPT-Realtime-2 doesn’t erase Cartesia, Deepgram, or ElevenLabs: it reshuffles the deck for teams willing to accept the audio token ticket and who want native reasoning, tool use, and SIP without building the plumbing themselves.

The real novelty is less the model than the stack: GPT-Realtime-2 + Translate + Whisper form a coherent audio platform, each component billed separately.

Are you evaluating or migrating to GPT-Realtime-2?

Check out our comparison of AI voice assistants in 2026 to frame options beyond OpenAI, and align your cost assumptions on 50 timed conversations with your carrier before going live.

GPT-Realtime-2 FAQ

When did GPT-Realtime-2 go GA?

OpenAI moved the Realtime API to general availability on May 7, 2026, with simultaneous withdrawal of gpt-4o-realtime preview models.

How much does GPT-Realtime-2 cost per minute of conversation?

A typical session burns 800 input tokens and 1,200 output tokens per minute, costing 0.15 to 0.20 dollars / minute excluding cache (cache at 0.40 dollars / M, -98.75%).

What latency to expect in production over 30 to 60 turns?

Artificial Analysis measures 1.12 s in minimal and 2.33 s in high for time-to-first-audio, field feedback up to 3.4 s in xhigh on long sessions.

When to switch reasoning_effort from low to xhigh?

Low for 80% of the current flow, high for turns requiring a chain of inference, xhigh when quality is more important than latency.

How to wire Translate and Whisper without doubling the bill?

Translate sessions (0.034 dollars / min) and Whisper (0.017 dollars / min) are independent, billed per minute, and add to the Realtime-2 cost only when the scenario calls for them.

Should you migrate immediately from GPT-Realtime-1.5?

The gpt-4o-realtime preview models have been discontinued since May 7, 2026, but gpt-realtime-1.5 deployments continue to function, with the canonical path now being gpt-realtime-2.

Is French on par with English on Cedar and Marin?

English-French code-switching works, but instruction following is sometimes ignored according to May 2026 feedback, parity is not guaranteed.

What are the differences between native speech-to-speech and pipeline cascade?

Native speech-to-speech ingests and emits audio without intermediate transcription (150-500 ms, prosody preserved), while the STT to LLM to TTS cascade chains three services (500-800 ms, cost 3 to 5 times lower).

For which use case does GPT-Realtime-2 surpass Cartesia or Gemini?

Multi-turn reasoning with rich tool use in a single API: Cartesia wins on pure latency, Gemini on cost at very high volume, GPT-Realtime-2 on tool integration and native SIP.

Is the 128k token context a free gain?

No: each token consumed inflates the bill to 32 / 64 dollars / M, the truncation token limit remains the recommended practice beyond 10 minutes.

GPT-Realtime-2: OpenAI’s voice expands to 128k context and reasoning