TL;DR
- XTTS is an open-source text-to-speech model from Coqui that synthesises speech in 17 languages and clones a target voice from a short reference audio sample (typically 6-15 seconds).
- XTTS-v2, released late 2023, is the most widely deployed version and is available under the Coqui Public Model Licence (CPML) — non-commercial by default with separate commercial terms.
- The model combines a discrete audio tokeniser, a GPT-style autoregressive Transformer that predicts audio tokens conditioned on text and speaker embedding, and a HiFi-GAN-style decoder that reconstructs the waveform.
- After Coqui the company wound down in early 2024, the community fork (idiap/coqui-ai-TTS) and many derivative projects keep XTTS actively maintained.
What XTTS Does#
XTTS is a 'zero-shot' TTS model: given (1) a piece of text and (2) a few seconds of reference audio from a target speaker, it produces speech in the target voice reading the given text. No per-speaker fine-tuning is required. The reference clip can be different content, different language, and noisy within limits, although a clean 6-15 second sample produces the best results.
It supports 17 languages — English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Mandarin, Japanese, Hungarian, Korean, and Hindi — and can clone an English voice into any of them (cross-lingual cloning).
Architecture#
XTTS-v2 follows the 'autoregressive token LM + neural vocoder' template that has dominated open TTS since Tortoise:
- Speaker encoder — extracts a speaker embedding from the reference clip.
- Text encoder — character or phoneme embeddings with language conditioning.
- Autoregressive Transformer — predicts discrete audio tokens (from a VQ-VAE-like tokeniser) conditioned on text tokens and speaker embedding.
- Diffusion / GAN decoder — converts the audio token sequence back to a waveform at 24 kHz. XTTS-v2 ships a HiFi-GAN-based decoder; earlier versions used a diffusion decoder borrowed from Tortoise.
Quality, Latency, and Limitations#
XTTS-v2 produces convincingly human prosody on most prompts, especially when the reference clip is high quality. It is noticeably faster than Tortoise and Bark — typically real-time or faster on a single L4 or L40S GPU for sentence-length utterances — though still slower than dedicated streaming TTS systems like StyleTTS 2 or proprietary commercial APIs.
Common failure modes: occasional skipped or repeated words on long inputs, audible artefacts at sentence boundaries when the reference voice is very expressive, and degraded cross-lingual cloning when the source and target phoneme inventories diverge sharply (e.g. English → Mandarin tone realisation).
Voice cloning models carry direct misuse risk — impersonation, fraud, and non-consensual synthesis. Yobibyte deployments enforce reference-clip consent attestation and watermarking on all XTTS endpoints by default.
Licensing#
XTTS-v2 weights are released under the Coqui Public Model Licence — free for non-commercial use, research, and evaluation, with a separate commercial licence required for production deployment. Several community forks (notably idiap/coqui-ai-TTS) continue to ship the original checkpoint and bugfixes, but the licence on the weights themselves is unchanged.
Teams that need fully unencumbered open licences typically pair XTTS for development and prototyping with a relicensed alternative (e.g. Parler-TTS, Kokoro, or in-house fine-tuned models) for production.
References
- XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model · arXiv
- idiap/coqui-ai-TTS (community fork) · GitHub
- XTTS-v2 model card · Hugging Face