Curiosity · AI Model

Cartesia Sonic

Sonic is Cartesia's flagship text-to-speech model, released in May 2024 by a team of ex-Stanford authors of the Mamba / S4 state-space-model papers. Unlike transformer-based TTS, Sonic is built on linear-recurrence SSM architectures, achieving ~90 ms time-to-first-audio at streaming speeds — the fastest production TTS at release. It offers instant voice cloning from a few seconds of reference audio, more than a dozen base voices across 15+ languages, and an API designed for real-time voice agents.

Model specs

Vendor
Cartesia
Family
Sonic
Released
2024-05
Context window
4,096 tokens
Modalities
text, audio
Input price
n/a
Output price
n/a
Pricing as of
2026-04-20

Strengths

  • Class-leading streaming latency
  • Instant voice cloning from short samples
  • 15+ languages supported
  • Competitive pricing vs transformer-based TTS

Limitations

  • Closed weights — API only
  • Expressive range still shorter than ElevenLabs at top tier
  • SSM tooling less mature than transformer TTS
  • Voice-cloning consent checks limit some use cases

Use cases

  • Real-time voice agents and IVR
  • Live call automation and customer support
  • Multilingual podcast and audiobook production
  • Voice cloning for accessibility tools

Benchmarks

BenchmarkScoreAs of
Time-to-first-audio (streaming)≈90 ms2024-05
MOS (English neutral voices)≈4.3 / 52024-05

Frequently asked questions

What is Cartesia Sonic?

Sonic is Cartesia's text-to-speech model, built on state-space-model architectures to achieve ~90 ms streaming time-to-first-audio. It targets real-time voice agents.

What is a state-space model?

A class of sequence models (Mamba, S4) based on linear recurrences. They scale linearly in sequence length, which suits streaming speech.

Does Sonic support voice cloning?

Yes — instant voice cloning from a few seconds of reference audio, with consent checks.

Sources

  1. Cartesia — Sonic launch — accessed 2026-04-20
  2. Cartesia product site — accessed 2026-04-20