TL;DR
- Tortoise is an open-source text-to-speech system released by James Betker (neonbjb) in 2022. The name acknowledges its defining trait: very high quality, very slow inference.
- It pioneered the 'autoregressive token LM + diffusion decoder + CLVP reranker' template that XTTS, Bark, and many other open TTS systems built on.
- Tortoise emphasises expressive English speech and supports voice cloning from short reference clips. Inference on a single GPU typically takes tens of seconds to minutes per utterance — fine for podcasts and audiobooks, unusable for real-time UX.
- Released under the Apache 2.0 licence; the original repository (neonbjb/tortoise-tts) remains the canonical source and is widely forked for research.
What Made Tortoise Interesting#
When Tortoise was released in early 2022, open TTS was dominated by fast, robotic-sounding models (Tacotron 2, FastSpeech 2, VITS). Tortoise traded latency for naturalness and demonstrated that an autoregressive Transformer over discrete audio tokens, paired with a diffusion decoder, could approach commercial-quality expressive speech without proprietary data.
The architecture became the template for nearly every open TTS system that followed. XTTS, Bark, MetaVoice, and Parler-TTS all owe lineage to Tortoise's separation of (text → audio tokens) and (audio tokens → waveform) stages.
Architecture#
Tortoise stacks five components:
- Autoregressive Transformer — predicts a sequence of discrete VQ-VAE audio tokens conditioned on the text and a small set of reference clips from the target speaker.
- Contrastive Language-Voice Pretrained model (CLVP) — scores candidate token sequences against the text and is used to rerank multiple AR samples.
- Conditional latent diffusion model — takes the chosen token sequence and refines it into a high-quality MEL spectrogram representation.
- UnivNet vocoder — converts the MEL representation to a 24 kHz waveform.
- Random latent generator — adds stochasticity that lets Tortoise produce multiple plausible deliveries of the same line.
Quality vs Latency#
Tortoise's defining trade-off is latency. Generating a single sentence can take anywhere from 10 seconds to several minutes on a consumer GPU depending on the 'preset' (ultra-fast, fast, standard, high-quality). Higher quality presets sample more candidates from the AR Transformer and rerank with CLVP, multiplying the cost.
In return, Tortoise produces some of the most expressive and characterful speech available from any open model, with strong intonation and convincing emotional range. For asynchronous use cases — audiobooks, voice-overs, podcasts, accessibility transcripts read back — the trade-off is often acceptable.
If you need both naturalness and sub-second latency, Tortoise is the wrong choice. Use XTTS-v2 or a commercial API (ElevenLabs) for interactive UX and reserve Tortoise for batch synthesis where quality dominates.
Use Today#
Tortoise is rarely used in fresh production deployments in 2026 — successors like XTTS-v2 match or exceed its quality at a fraction of the latency. It remains influential as a research baseline and as a reference architecture for anyone building a new diffusion-based TTS system. The repository is still active and widely cited in TTS papers.
References
- Better speech synthesis through scaling (Tortoise tech report) · arXiv
- neonbjb/tortoise-tts · GitHub
- Tortoise model card · Hugging Face