Byte-Pair Encoding (BPE)

TL;DR

BPE for NLP (Sennrich et al., 2015, arXiv:1508.07909) iteratively merges the most frequent pair of adjacent symbols in a training corpus until a target vocabulary size is reached.
It originally compressed data (Gage, 1994); Sennrich's adaptation tokenised words into subword units so neural translation models could handle rare words.
Byte-level BPE (GPT-2, 2019) starts from raw bytes instead of Unicode characters, guaranteeing a closed vocabulary on any input — the standard for modern English/code LLMs.
Tokenisers built on BPE underlie GPT, Llama, Mistral, Qwen and DeepSeek; SentencePiece and tiktoken are the dominant implementations.

How BPE Works#

Start with a vocabulary that contains every individual character (or byte) in the training corpus. Count adjacent symbol pairs across the corpus. Merge the most frequent pair into a new symbol, add it to the vocabulary, and re-count. Repeat until the vocabulary reaches the desired size — typically 32,000 to 256,000 tokens.

At inference, the same merge rules are applied greedily to incoming text. Each merge is recorded as a rule, so encoding is deterministic and reversible.

python

# Conceptual outline; production tokenisers use efficient C++.
def train_bpe(corpus, num_merges):
    vocab = set(ch for word in corpus for ch in word)
    splits = {word: list(word) for word in corpus}
    merges = []
    for _ in range(num_merges):
        pair_counts = Counter()
        for word, count in corpus.items():
            symbols = splits[word]
            for a, b in zip(symbols, symbols[1:]):
                pair_counts[(a, b)] += count
        if not pair_counts:
            break
        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)
        vocab.add(best_pair[0] + best_pair[1])
        # Rewrite splits to apply this merge.
    return vocab, merges

Why Subwords Work#

Word-level tokenisation has a closed-vocabulary problem: any word not seen in training becomes an unknown token. Character-level tokenisation has no vocabulary problem but produces very long sequences and poor positional generalisation. BPE sits between: common words become single tokens, rare words decompose into meaningful sub-units (un- + break- + -able), and out-of-vocabulary words remain expressible.

The result is a fixed-size vocabulary (a hard requirement for the unembedding matrix) that covers any input — provided the base symbols cover all characters/bytes.

Byte-Level BPE#

Standard BPE starts from Unicode characters, which means rare scripts, emoji and binary input may still produce unknowns. GPT-2's byte-level BPE starts from the 256 raw bytes, making the base alphabet trivially complete for any byte sequence. Every input is exactly representable; nothing is unknown.

Llama, Mistral and most modern English-and-code LLMs use byte-level BPE. The trade-off is that single multibyte characters (e.g. emoji) may decompose into multiple tokens, which slightly inflates Chinese and Japanese sequence lengths versus a Unicode-aware tokeniser like SentencePiece's BPE mode.

Vocabulary Sizes#

Vocabulary size trades off two things: a larger vocab means fewer tokens per text (cheaper inference) but a larger embedding/unembedding matrix (more parameters and memory). Llama 3's jump from 32k to 128k cut average tokens-per-sequence by roughly 15-20 per cent on multilingual and code data.

Model	Vocab size	Notes
GPT-2	50,257	Byte-level BPE
GPT-3, GPT-3.5	50,257	Same tokeniser as GPT-2
GPT-4	100,277 (cl100k_base)	Byte-level BPE
Llama 2	32,000	SentencePiece BPE
Llama 3	128,256	Tiktoken-style byte BPE
Qwen 2	151,936	Byte-level BPE
DeepSeek-V3	129,280	Byte-level BPE

Implementations#

Three implementations dominate: OpenAI's tiktoken (Rust, byte-level, used for GPT-3.5/4/4o), HuggingFace tokenizers (Rust, supports BPE and other algorithms), and Google's SentencePiece (C++, supports BPE and Unigram). All produce equivalent encodings given the same training corpus and parameters; the differences are speed and library ergonomics.

Where BPE Falls Short#

Greedy merging produces tokenisations that are deterministic but not optimal. Sennrich's original paper introduced BPE-dropout, which randomises merges during training for regularisation. The Unigram tokeniser (used by Llama 1 via SentencePiece) is a probabilistic alternative that scores all possible segmentations.

Recent research on tokeniser-free models (ByT5, Charformer, MambaByte) shows that operating on raw bytes can match BPE on quality at the cost of longer sequences. As of 2026, BPE remains dominant — efficiency wins at scale.

GPT-4's tokeniser cl100k_base allocates many tokens to English programming-language patterns, which is why it is roughly 1.6× more efficient than GPT-3's on code and English but only slightly better on other languages.

References

TL;DR

BPE for NLP (Sennrich et al., 2015, arXiv:1508.07909) iteratively merges the most frequent pair of adjacent symbols in a training corpus until a target vocabulary size is reached.
It originally compressed data (Gage, 1994); Sennrich's adaptation tokenised words into subword units so neural translation models could handle rare words.
Byte-level BPE (GPT-2, 2019) starts from raw bytes instead of Unicode characters, guaranteeing a closed vocabulary on any input — the standard for modern English/code LLMs.
Tokenisers built on BPE underlie GPT, Llama, Mistral, Qwen and DeepSeek; SentencePiece and tiktoken are the dominant implementations.

How BPE Works#

At inference, the same merge rules are applied greedily to incoming text. Each merge is recorded as a rule, so encoding is deterministic and reversible.

python

# Conceptual outline; production tokenisers use efficient C++.
def train_bpe(corpus, num_merges):
    vocab = set(ch for word in corpus for ch in word)
    splits = {word: list(word) for word in corpus}
    merges = []
    for _ in range(num_merges):
        pair_counts = Counter()
        for word, count in corpus.items():
            symbols = splits[word]
            for a, b in zip(symbols, symbols[1:]):
                pair_counts[(a, b)] += count
        if not pair_counts:
            break
        best_pair = max(pair_counts, key=pair_counts.get)
        merges.append(best_pair)
        vocab.add(best_pair[0] + best_pair[1])
        # Rewrite splits to apply this merge.
    return vocab, merges

Model	Vocab size	Notes
GPT-2	50,257	Byte-level BPE
GPT-3, GPT-3.5	50,257	Same tokeniser as GPT-2
GPT-4	100,277 (cl100k_base)	Byte-level BPE
Llama 2	32,000	SentencePiece BPE
Llama 3	128,256	Tiktoken-style byte BPE
Qwen 2	151,936	Byte-level BPE
DeepSeek-V3	129,280	Byte-level BPE

Byte-Pair Encoding (BPE)

How BPE Works#

Why Subwords Work#

Byte-Level BPE#

Vocabulary Sizes#

Implementations#

Where BPE Falls Short#

References

Browse all entries

Deploy on Yobitel

Byte-Pair Encoding (BPE)

How BPE Works#

Why Subwords Work#

Byte-Level BPE#

Vocabulary Sizes#

Implementations#

Where BPE Falls Short#

References

Browse all entries

Deploy on Yobitel