Curiosity · AI Model

Cartesia Sonic

Sonic is Cartesia's flagship text-to-speech model, released in May 2024 by a team of ex-Stanford authors of the Mamba / S4 state-space-model papers. Unlike transformer-based TTS, Sonic is built on linear-recurrence SSM architectures, achieving ~90 ms time-to-first-audio at streaming speeds — the fastest production TTS at release. It offers instant voice cloning from a few seconds of reference audio, more than a dozen base voices across 15+ languages, and an API designed for real-time voice agents.

Model specs

Vendor: Cartesia
Family: Sonic
Released: 2024-05
Context window: 4,096 tokens
Modalities: text, audio
Input price: n/a
Output price: n/a
Pricing as of: 2026-04-20

Strengths

Class-leading streaming latency
Instant voice cloning from short samples
15+ languages supported
Competitive pricing vs transformer-based TTS

Limitations

Closed weights — API only
Expressive range still shorter than ElevenLabs at top tier
SSM tooling less mature than transformer TTS
Voice-cloning consent checks limit some use cases

Use cases

Real-time voice agents and IVR
Live call automation and customer support
Multilingual podcast and audiobook production
Voice cloning for accessibility tools

Benchmarks

Benchmark	Score	As of
Time-to-first-audio (streaming)	≈90 ms	2024-05
MOS (English neutral voices)	≈4.3 / 5	2024-05

Frequently asked questions

What is Cartesia Sonic?

Sonic is Cartesia's text-to-speech model, built on state-space-model architectures to achieve ~90 ms streaming time-to-first-audio. It targets real-time voice agents.

What is a state-space model?

A class of sequence models (Mamba, S4) based on linear recurrences. They scale linearly in sequence length, which suits streaming speech.

Does Sonic support voice cloning?

Yes — instant voice cloning from a few seconds of reference audio, with consent checks.

Sources

Cartesia — Sonic launch — accessed 2026-04-20
Cartesia product site — accessed 2026-04-20