Curiosity · AI Model

GPT Realtime

GPT Realtime is OpenAI's production-grade speech-to-speech model, the successor to GPT-4o Realtime. It takes microphone audio in and emits generated voice out with ~300ms of round-trip latency, supports interruptions and tool calls, and streams over WebRTC — making it the default backbone for voice agents, phone support, and real-time assistants.

Model specs

Vendor: OpenAI
Family: GPT Realtime
Released: 2025-08
Context window: 128,000 tokens
Modalities: text, audio, code
Input price: $32/M tok
Output price: $64/M tok
Pricing as of: 2026-04-20

Strengths

Speech in, speech out — no ASR + LLM + TTS pipeline to stitch together
Preserves tone, laughter, emotion, and non-lexical cues
Interruptions and barge-in work naturally
Function calling integrates with existing OpenAI tool schemas

Limitations

Audio tokens are expensive — cost per minute adds up fast
Less configurable than ASR + TTS pipelines for brand-specific voices
Quality of non-English voices varies; test before deployment
No streamed text output — pair with text model for transcripts

Use cases

Voice agents for customer support and outbound calling
Real-time language tutoring and pronunciation coaches
In-car and wearable assistants where latency matters
Accessible interfaces — screen reader replacements, voice-first apps

Benchmarks

Benchmark	Score	As of
Big Bench Audio	≈82%	2025-08
MultiChallenge (audio)	≈30%	2025-08
End-to-end voice latency	≈300ms	2025-08

Frequently asked questions

What is GPT Realtime?

GPT Realtime is OpenAI's speech-to-speech model designed for voice agents. It takes audio in and produces audio out directly, with ~300ms end-to-end latency, support for interruptions, and tool calling — released as a generally-available successor to GPT-4o Realtime in August 2025.

How is GPT Realtime different from GPT-4o audio?

GPT-4o offered audio capabilities via the original Realtime preview; GPT Realtime is the production-grade successor with lower latency, more reliable tool calls, and upgraded voices. For new voice agent projects, GPT Realtime is the recommended choice.

How much does GPT Realtime cost?

As of April 2026, GPT Realtime audio tokens are priced roughly USD 32 per million input tokens and USD 64 per million output tokens. Per-minute cost depends on speech density but typically runs in the low tens of cents per minute of conversation.

Can GPT Realtime call tools or functions?

Yes. GPT Realtime supports OpenAI-style function calling inside a live conversation, letting a voice agent look up orders, control a device, or call an MCP server mid-sentence.

Sources

OpenAI — GPT Realtime — accessed 2026-04-20
OpenAI — Realtime API docs — accessed 2026-04-20