wav2vec 2.0

TL;DR

wav2vec 2.0, introduced by Baevski et al. in the 2020 paper 'wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations' (arXiv:2006.11477), pre-trains a Transformer on unlabelled audio with a contrastive objective and fine-tunes on small labelled sets with CTC.
It demonstrated that 10 minutes of labelled audio plus 53,000 hours of unlabelled pre-training could beat earlier systems trained on 960 hours of labelled LibriSpeech.
The architecture pairs a convolutional feature encoder (raw waveform → 50 Hz latent features) with a Transformer context network, and uses quantised target representations for the contrastive loss.
Released under MIT licence by Meta AI. The XLS-R and MMS extensions scale the recipe to thousands of languages and remain widely used for low-resource ASR, forced alignment, and feature extraction in 2026.

The Self-Supervised Idea#

Labelled speech data is expensive: every hour of audio needs transcription, ideally by a trained annotator. Unlabelled speech is essentially unlimited — podcasts, audiobooks, broadcast archives. Self-supervised learning exploits this by defining a pretext task that requires only the audio itself, learning representations that downstream tasks (ASR, language ID, emotion recognition, speaker verification) can fine-tune cheaply.

wav2vec 2.0's pretext task is contrastive: predict the correct quantised latent from a masked portion of audio against a set of distractors drawn from other masked positions. Solving this task forces the Transformer to encode phonetic and prosodic structure even without ever seeing a transcript.

Architecture#

wav2vec 2.0 has three components stacked in sequence:

Feature encoder — a stack of 1D convolutions that maps raw waveform at 16 kHz down to a sequence of 512-dim latent vectors at ~50 Hz (20 ms per frame).
Quantisation module — a product-quantiser with two codebooks of 320 entries each, producing discrete targets for the contrastive loss using Gumbel-Softmax during training.
Context Transformer — 12 or 24 Transformer layers (Base / Large) that consume the latent sequence and produce contextualised representations. The contrastive loss is applied to spans masked at this Transformer's input.

Model	Parameters	Transformer layers	Pre-training data
Base	95M	12	LibriSpeech 960h
Large	317M	24	LibriSpeech 960h / LibriLight 53k h
XLS-R 300M	300M	24	436k h, 128 languages
XLS-R 1B / 2B	1B / 2B	48	436k h, 128 languages
MMS-1B	~1B	48	~500k h, 1,406 languages

Pre-training and Fine-tuning#

Pre-training masks roughly 50% of the latent frames in spans of 10 frames each and optimises a contrastive loss with a diversity regulariser that encourages even codebook usage. The result is a Transformer that has never seen a transcript but encodes speech in a form trivially mapped to phonemes.

Fine-tuning adds a linear projection from the Transformer output to a character or phoneme vocabulary and trains with Connectionist Temporal Classification (CTC) loss. With 100 hours of labelled LibriSpeech, fine-tuned Large reaches ~1.9 / 3.9 WER on test-clean / test-other; with 10 minutes of labels and an external language model, it still achieves single-digit WER on test-clean.

Multilingual Scaling — XLS-R and MMS#

XLS-R (Babu et al., 2021) scaled the recipe to 436,000 hours across 128 languages, producing 300M, 1B, and 2B parameter checkpoints that remain strong defaults for cross-lingual ASR and language ID transfer learning.

MMS (Massively Multilingual Speech, Pratap et al., 2023) extended pre-training to over 1,400 languages and released ASR and TTS fine-tuned heads. For many lower-resource languages MMS remains the only open ASR model available in 2026, even if Whisper has overtaken it on the high-resource subset.

Beyond ASR#

wav2vec 2.0 representations are used widely beyond direct transcription:

Forced alignment — WhisperX and many subtitle tools use a wav2vec 2.0 CTC head to align words to precise timestamps.
Speaker embedding and verification — fine-tuning the encoder for x-vector-style speaker representations.
Emotion and intent classification — light heads on the frozen encoder.
Pseudo-labelling for low-resource ASR — generating noisy transcripts to seed semi-supervised training of larger models.

Whisper and wav2vec 2.0 take opposite design philosophies — large weak supervision vs small clean supervision with self-supervised pre-training. Both are useful and the choice depends on data availability for your target language and domain.

References

The Self-Supervised Idea#

Architecture#

wav2vec 2.0 has three components stacked in sequence:

Feature encoder — a stack of 1D convolutions that maps raw waveform at 16 kHz down to a sequence of 512-dim latent vectors at ~50 Hz (20 ms per frame).

Quantisation module — a product-quantiser with two codebooks of 320 entries each, producing discrete targets for the contrastive loss using Gumbel-Softmax during training.

Context Transformer — 12 or 24 Transformer layers (Base / Large) that consume the latent sequence and produce contextualised representations. The contrastive loss is applied to spans masked at this Transformer's input.

Model	Parameters	Transformer layers	Pre-training data
Base	95M	12	LibriSpeech 960h
Large	317M	24	LibriSpeech 960h / LibriLight 53k h
XLS-R 300M	300M	24	436k h, 128 languages
XLS-R 1B / 2B	1B / 2B	48	436k h, 128 languages
MMS-1B	~1B	48	~500k h, 1,406 languages

Pre-training and Fine-tuning#

Multilingual Scaling — XLS-R and MMS#

Beyond ASR#

wav2vec 2.0 representations are used widely beyond direct transcription:

Forced alignment — WhisperX and many subtitle tools use a wav2vec 2.0 CTC head to align words to precise timestamps.

Speaker embedding and verification — fine-tuning the encoder for x-vector-style speaker representations.

Emotion and intent classification — light heads on the frozen encoder.

Pseudo-labelling for low-resource ASR — generating noisy transcripts to seed semi-supervised training of larger models.

wav2vec 2.0

The Self-Supervised Idea#

Architecture#

Pre-training and Fine-tuning#

Multilingual Scaling — XLS-R and MMS#

Beyond ASR#

References

Browse all entries

Deploy on Yobitel

wav2vec 2.0

The Self-Supervised Idea#

Architecture#

Pre-training and Fine-tuning#

Multilingual Scaling — XLS-R and MMS#

Beyond ASR#

References

Browse all entries

Deploy on Yobitel