TL;DR
- wav2vec 2.0, introduced by Baevski et al. in the 2020 paper 'wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations' (arXiv:2006.11477), pre-trains a Transformer on unlabelled audio with a contrastive objective and fine-tunes on small labelled sets with CTC.
- It demonstrated that 10 minutes of labelled audio plus 53,000 hours of unlabelled pre-training could beat earlier systems trained on 960 hours of labelled LibriSpeech.
- The architecture pairs a convolutional feature encoder (raw waveform → 50 Hz latent features) with a Transformer context network, and uses quantised target representations for the contrastive loss.
- Released under MIT licence by Meta AI. The XLS-R and MMS extensions scale the recipe to thousands of languages and remain widely used for low-resource ASR, forced alignment, and feature extraction in 2026.
The Self-Supervised Idea#
Labelled speech data is expensive: every hour of audio needs transcription, ideally by a trained annotator. Unlabelled speech is essentially unlimited — podcasts, audiobooks, broadcast archives. Self-supervised learning exploits this by defining a pretext task that requires only the audio itself, learning representations that downstream tasks (ASR, language ID, emotion recognition, speaker verification) can fine-tune cheaply.
wav2vec 2.0's pretext task is contrastive: predict the correct quantised latent from a masked portion of audio against a set of distractors drawn from other masked positions. Solving this task forces the Transformer to encode phonetic and prosodic structure even without ever seeing a transcript.
Architecture#
wav2vec 2.0 has three components stacked in sequence:
- Feature encoder — a stack of 1D convolutions that maps raw waveform at 16 kHz down to a sequence of 512-dim latent vectors at ~50 Hz (20 ms per frame).
- Quantisation module — a product-quantiser with two codebooks of 320 entries each, producing discrete targets for the contrastive loss using Gumbel-Softmax during training.
- Context Transformer — 12 or 24 Transformer layers (Base / Large) that consume the latent sequence and produce contextualised representations. The contrastive loss is applied to spans masked at this Transformer's input.
| Model | Parameters | Transformer layers | Pre-training data |
|---|---|---|---|
| Base | 95M | 12 | LibriSpeech 960h |
| Large | 317M | 24 | LibriSpeech 960h / LibriLight 53k h |
| XLS-R 300M | 300M | 24 | 436k h, 128 languages |
| XLS-R 1B / 2B | 1B / 2B | 48 | 436k h, 128 languages |
| MMS-1B | ~1B | 48 | ~500k h, 1,406 languages |
Pre-training and Fine-tuning#
Pre-training masks roughly 50% of the latent frames in spans of 10 frames each and optimises a contrastive loss with a diversity regulariser that encourages even codebook usage. The result is a Transformer that has never seen a transcript but encodes speech in a form trivially mapped to phonemes.
Fine-tuning adds a linear projection from the Transformer output to a character or phoneme vocabulary and trains with Connectionist Temporal Classification (CTC) loss. With 100 hours of labelled LibriSpeech, fine-tuned Large reaches ~1.9 / 3.9 WER on test-clean / test-other; with 10 minutes of labels and an external language model, it still achieves single-digit WER on test-clean.
Multilingual Scaling — XLS-R and MMS#
XLS-R (Babu et al., 2021) scaled the recipe to 436,000 hours across 128 languages, producing 300M, 1B, and 2B parameter checkpoints that remain strong defaults for cross-lingual ASR and language ID transfer learning.
MMS (Massively Multilingual Speech, Pratap et al., 2023) extended pre-training to over 1,400 languages and released ASR and TTS fine-tuned heads. For many lower-resource languages MMS remains the only open ASR model available in 2026, even if Whisper has overtaken it on the high-resource subset.
Beyond ASR#
wav2vec 2.0 representations are used widely beyond direct transcription:
- Forced alignment — WhisperX and many subtitle tools use a wav2vec 2.0 CTC head to align words to precise timestamps.
- Speaker embedding and verification — fine-tuning the encoder for x-vector-style speaker representations.
- Emotion and intent classification — light heads on the frozen encoder.
- Pseudo-labelling for low-resource ASR — generating noisy transcripts to seed semi-supervised training of larger models.
Whisper and wav2vec 2.0 take opposite design philosophies — large weak supervision vs small clean supervision with self-supervised pre-training. Both are useful and the choice depends on data availability for your target language and domain.