Curiosity · AI Model
GPT Realtime
GPT Realtime is OpenAI's production-grade speech-to-speech model, the successor to GPT-4o Realtime. It takes microphone audio in and emits generated voice out with ~300ms of round-trip latency, supports interruptions and tool calls, and streams over WebRTC — making it the default backbone for voice agents, phone support, and real-time assistants.
Model specs
- Vendor
- OpenAI
- Family
- GPT Realtime
- Released
- 2025-08
- Context window
- 128,000 tokens
- Modalities
- text, audio, code
- Input price
- $32/M tok
- Output price
- $64/M tok
- Pricing as of
- 2026-04-20
Strengths
- Speech in, speech out — no ASR + LLM + TTS pipeline to stitch together
- Preserves tone, laughter, emotion, and non-lexical cues
- Interruptions and barge-in work naturally
- Function calling integrates with existing OpenAI tool schemas
Limitations
- Audio tokens are expensive — cost per minute adds up fast
- Less configurable than ASR + TTS pipelines for brand-specific voices
- Quality of non-English voices varies; test before deployment
- No streamed text output — pair with text model for transcripts
Use cases
- Voice agents for customer support and outbound calling
- Real-time language tutoring and pronunciation coaches
- In-car and wearable assistants where latency matters
- Accessible interfaces — screen reader replacements, voice-first apps
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| Big Bench Audio | ≈82% | 2025-08 |
| MultiChallenge (audio) | ≈30% | 2025-08 |
| End-to-end voice latency | ≈300ms | 2025-08 |
Frequently asked questions
What is GPT Realtime?
GPT Realtime is OpenAI's speech-to-speech model designed for voice agents. It takes audio in and produces audio out directly, with ~300ms end-to-end latency, support for interruptions, and tool calling — released as a generally-available successor to GPT-4o Realtime in August 2025.
How is GPT Realtime different from GPT-4o audio?
GPT-4o offered audio capabilities via the original Realtime preview; GPT Realtime is the production-grade successor with lower latency, more reliable tool calls, and upgraded voices. For new voice agent projects, GPT Realtime is the recommended choice.
How much does GPT Realtime cost?
As of April 2026, GPT Realtime audio tokens are priced roughly USD 32 per million input tokens and USD 64 per million output tokens. Per-minute cost depends on speech density but typically runs in the low tens of cents per minute of conversation.
Can GPT Realtime call tools or functions?
Yes. GPT Realtime supports OpenAI-style function calling inside a live conversation, letting a voice agent look up orders, control a device, or call an MCP server mid-sentence.
Sources
- OpenAI — GPT Realtime — accessed 2026-04-20
- OpenAI — Realtime API docs — accessed 2026-04-20