Byte-Pair Encoding (BPE) Tokenization Variants is a family of subword segmentation algorithms that decompose text into variable-length token units by iteratively merging frequent character or byte sequences — enabling open-vocabulary language modeling without out-of-vocabulary tokens while balancing vocabulary size against sequence length.
Classical BPE Algorithm
BPE (Sennrich et al., 2016) starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair into a new token. Training proceeds for a fixed number of merge operations (typically 32K-50K merges). The resulting vocabulary captures common subwords (e.g., "ing", "tion", "pre") while rare words decompose into smaller units. Encoding applies learned merges greedily left-to-right. GPT-2 and GPT-3 use byte-level BPE operating on raw UTF-8 bytes rather than Unicode characters, eliminating unknown characters entirely.
SentencePiece and Language-Agnostic Tokenization
- SentencePiece: Treats input as raw byte stream without pre-tokenization (no language-specific word boundary assumptions)
- Whitespace handling: Replaces spaces with special underscore character (▁) so tokenization is fully reversible
- Training modes: Supports both BPE and Unigram algorithms within the same framework
- Normalization: Built-in Unicode NFKC normalization ensures consistent tokenization across scripts
- Adoption: Used by T5, LLaMA, PaLM, Gemma, and most multilingual models
Unigram Language Model Tokenization
- Probabilistic approach: Starts with a large candidate vocabulary and iteratively removes tokens that least reduce the corpus likelihood
- Subword regularization: Samples from multiple valid segmentations during training (e.g., "unbreakable" → ["un", "break", "able"] or ["unbreak", "able"])
- EM algorithm: Expectation-Maximization optimizes token probabilities; Viterbi decoding finds most probable segmentation at inference
- Advantages over BPE: More robust tokenization (not order-dependent), better handling of morphologically rich languages
- Vocabulary pruning: Removes 20-30% of initial vocabulary per iteration until target size reached
WordPiece Tokenization
- Google's variant: Used in BERT, DistilBERT, and Electra models
- Likelihood-based merging: Merges pairs that maximize the language model likelihood of the training corpus (not just frequency)
- Prefix markers: Uses ## prefix for continuation subwords (e.g., "playing" → ["play", "##ing"])
- Greedy longest-match: Encoding applies longest-match-first from the vocabulary rather than learned merge order
- Vocabulary size: BERT uses 30,522 WordPiece tokens covering 104 languages
Tokenization Impact on Model Performance
- Fertility rate: Average tokens per word varies by language (English ~1.2, Chinese ~1.8, Finnish ~2.5 for BPE-50K)
- Compression ratio: Better tokenizers produce shorter sequences, reducing compute cost and enabling longer effective context
- Tokenizer-model coupling: Changing tokenizers requires retraining; vocabulary mismatch degrades transfer learning
- Byte-level fallback: Models like LLaMA use byte-fallback BPE—unknown characters decompose to raw bytes rather than UNK tokens
- Tiktoken: OpenAI's fast BPE implementation used for GPT-4 with cl100k_base vocabulary (100,256 tokens)
Emerging Tokenization Research
- Tokenizer-free models: ByT5 and MegaByte operate directly on bytes, eliminating tokenization artifacts at the cost of longer sequences
- Dynamic vocabularies: Adaptive tokenization adjusts vocabulary based on input domain or language
- Multilingual fairness: BPE vocabularies trained on English-heavy corpora under-represent other languages, causing fertility inflation and reduced effective context length
- Visual tokenizers: VQ-VAE and VQGAN discretize image patches into tokens for vision transformers
Subword tokenization remains the foundational bridge between raw text and neural network computation, with tokenizer quality directly impacting model efficiency, multilingual equity, and downstream task performance across all modern language models.