TL;DR
- BPE for NLP (Sennrich et al., 2015, arXiv:1508.07909) iteratively merges the most frequent pair of adjacent symbols in a training corpus until a target vocabulary size is reached.
- It originally compressed data (Gage, 1994); Sennrich's adaptation tokenised words into subword units so neural translation models could handle rare words.
- Byte-level BPE (GPT-2, 2019) starts from raw bytes instead of Unicode characters, guaranteeing a closed vocabulary on any input — the standard for modern English/code LLMs.
- Tokenisers built on BPE underlie GPT, Llama, Mistral, Qwen and DeepSeek; SentencePiece and tiktoken are the dominant implementations.
How BPE Works#
Start with a vocabulary that contains every individual character (or byte) in the training corpus. Count adjacent symbol pairs across the corpus. Merge the most frequent pair into a new symbol, add it to the vocabulary, and re-count. Repeat until the vocabulary reaches the desired size — typically 32,000 to 256,000 tokens.
At inference, the same merge rules are applied greedily to incoming text. Each merge is recorded as a rule, so encoding is deterministic and reversible.
# Conceptual outline; production tokenisers use efficient C++.
def train_bpe(corpus, num_merges):
vocab = set(ch for word in corpus for ch in word)
splits = {word: list(word) for word in corpus}
merges = []
for _ in range(num_merges):
pair_counts = Counter()
for word, count in corpus.items():
symbols = splits[word]
for a, b in zip(symbols, symbols[1:]):
pair_counts[(a, b)] += count
if not pair_counts:
break
best_pair = max(pair_counts, key=pair_counts.get)
merges.append(best_pair)
vocab.add(best_pair[0] + best_pair[1])
# Rewrite splits to apply this merge.
return vocab, mergesWhy Subwords Work#
Word-level tokenisation has a closed-vocabulary problem: any word not seen in training becomes an unknown token. Character-level tokenisation has no vocabulary problem but produces very long sequences and poor positional generalisation. BPE sits between: common words become single tokens, rare words decompose into meaningful sub-units (un- + break- + -able), and out-of-vocabulary words remain expressible.
The result is a fixed-size vocabulary (a hard requirement for the unembedding matrix) that covers any input — provided the base symbols cover all characters/bytes.
Byte-Level BPE#
Standard BPE starts from Unicode characters, which means rare scripts, emoji and binary input may still produce unknowns. GPT-2's byte-level BPE starts from the 256 raw bytes, making the base alphabet trivially complete for any byte sequence. Every input is exactly representable; nothing is unknown.
Llama, Mistral and most modern English-and-code LLMs use byte-level BPE. The trade-off is that single multibyte characters (e.g. emoji) may decompose into multiple tokens, which slightly inflates Chinese and Japanese sequence lengths versus a Unicode-aware tokeniser like SentencePiece's BPE mode.
Vocabulary Sizes#
Vocabulary size trades off two things: a larger vocab means fewer tokens per text (cheaper inference) but a larger embedding/unembedding matrix (more parameters and memory). Llama 3's jump from 32k to 128k cut average tokens-per-sequence by roughly 15-20 per cent on multilingual and code data.
| Model | Vocab size | Notes |
|---|---|---|
| GPT-2 | 50,257 | Byte-level BPE |
| GPT-3, GPT-3.5 | 50,257 | Same tokeniser as GPT-2 |
| GPT-4 | 100,277 (cl100k_base) | Byte-level BPE |
| Llama 2 | 32,000 | SentencePiece BPE |
| Llama 3 | 128,256 | Tiktoken-style byte BPE |
| Qwen 2 | 151,936 | Byte-level BPE |
| DeepSeek-V3 | 129,280 | Byte-level BPE |
Implementations#
Three implementations dominate: OpenAI's tiktoken (Rust, byte-level, used for GPT-3.5/4/4o), HuggingFace tokenizers (Rust, supports BPE and other algorithms), and Google's SentencePiece (C++, supports BPE and Unigram). All produce equivalent encodings given the same training corpus and parameters; the differences are speed and library ergonomics.
Where BPE Falls Short#
Greedy merging produces tokenisations that are deterministic but not optimal. Sennrich's original paper introduced BPE-dropout, which randomises merges during training for regularisation. The Unigram tokeniser (used by Llama 1 via SentencePiece) is a probabilistic alternative that scores all possible segmentations.
Recent research on tokeniser-free models (ByT5, Charformer, MambaByte) shows that operating on raw bytes can match BPE on quality at the cost of longer sequences. As of 2026, BPE remains dominant — efficiency wins at scale.
GPT-4's tokeniser cl100k_base allocates many tokens to English programming-language patterns, which is why it is roughly 1.6× more efficient than GPT-3's on code and English but only slightly better on other languages.