Curiosity · AI Model
Cartesia Sonic
Sonic is Cartesia's flagship text-to-speech model, released in May 2024 by a team of ex-Stanford authors of the Mamba / S4 state-space-model papers. Unlike transformer-based TTS, Sonic is built on linear-recurrence SSM architectures, achieving ~90 ms time-to-first-audio at streaming speeds — the fastest production TTS at release. It offers instant voice cloning from a few seconds of reference audio, more than a dozen base voices across 15+ languages, and an API designed for real-time voice agents.
Model specs
- Vendor
- Cartesia
- Family
- Sonic
- Released
- 2024-05
- Context window
- 4,096 tokens
- Modalities
- text, audio
- Input price
- n/a
- Output price
- n/a
- Pricing as of
- 2026-04-20
Strengths
- Class-leading streaming latency
- Instant voice cloning from short samples
- 15+ languages supported
- Competitive pricing vs transformer-based TTS
Limitations
- Closed weights — API only
- Expressive range still shorter than ElevenLabs at top tier
- SSM tooling less mature than transformer TTS
- Voice-cloning consent checks limit some use cases
Use cases
- Real-time voice agents and IVR
- Live call automation and customer support
- Multilingual podcast and audiobook production
- Voice cloning for accessibility tools
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| Time-to-first-audio (streaming) | ≈90 ms | 2024-05 |
| MOS (English neutral voices) | ≈4.3 / 5 | 2024-05 |
Frequently asked questions
What is Cartesia Sonic?
Sonic is Cartesia's text-to-speech model, built on state-space-model architectures to achieve ~90 ms streaming time-to-first-audio. It targets real-time voice agents.
What is a state-space model?
A class of sequence models (Mamba, S4) based on linear recurrences. They scale linearly in sequence length, which suits streaming speech.
Does Sonic support voice cloning?
Yes — instant voice cloning from a few seconds of reference audio, with consent checks.
Sources
- Cartesia — Sonic launch — accessed 2026-04-20
- Cartesia product site — accessed 2026-04-20