History of LLM Evolution (6): Unveiling the Mystery of Tokenizers

The Tokenizer is a vital yet often overlooked component of LLMs. In our previous language models, tokenization was character-level, creating an embedding table for all 65 possible characters. In practice, modern language models use more sophisticated schemes, operating at the chunk level using algorithms like Byte Pair Encoding (BPE).

For instance, in the GPT-2 paper Language Models are Unsupervised Multitask Learners, researchers built a vocabulary of 50,257 tokens with a context length of 1024 tokens.

Attention context

In the Transformer network’s attention layers, each token attends to the preceding 1024 tokens.

Tokens are the atomic units of a language model, and tokenization is the process of converting raw text strings into token sequences.

Side note: Some research (like MegaByte) explores inputting bytes directly without tokenization, though this is not yet fully proven.

First Taste

Tokenization is the root cause of many peculiar LLM behaviors:

Tokenization problems

Why can’t LLMs spell correctly? Tokenization.
Why do LLMs struggle with simple string tasks like reversal? Tokenization.
Why is LLM performance worse for non-English languages (e.g., Japanese)? Tokenization.
Why do LLMs struggle with simple arithmetic? Tokenization.
Why did GPT-2 have unnecessary trouble with Python coding? Tokenization.
Why does my LLM stop abruptly when it sees ”<|endoftext|>“. Tokenization.
What’s with the weird “trailing space” warnings? Tokenization.
Why does my LLM output nonsense for “SolidGoldMagikarp”? Tokenization.
Why should I prefer YAML over JSON with LLMs? Tokenization.
Why aren’t LLMs actually end-to-end language models? Tokenization.
What is the true source of suffering? Tokenization.

tiktokenizer.vercel.app provides visual tokenization. Let’s look at GPT-2:

GPT2 tokenization

Notice that a token often includes the preceding space.

English tokenization looks reasonable. However, in arithmetic, a single number might be split across two tokens. Case sensitivity affects tokenization. Non-English languages are wasteful, often one token per character. In Python, indentation consists of many space tokens, consuming context length. It’s as if non-English sequences are stretched relative to their meaning.

GPT-2’s poor Python performance wasn’t a model issue but a tokenization characteristic.

GPT4 tokenization

GPT-4’s tokenizer is much more “reasonable,” especially with whitespace. This intentional choice significantly contributed to GPT-4’s superior coding ability.

GPT-4 consumes nearly half the tokens for the same input, largely because its vocabulary doubled (~50k to 100k).

This compactness is beneficial as it roughly doubles the effective context length for the Transformer. However, increasing the embedding table size increases the cost of the final Softmax. We must balance compactness with computational efficiency.

Unicode

To feed strings into a neural network, we first tokenize them into integer sequences. These integers retrieve vectors from a lookup table for the Transformer input.

How do we support multiple languages and emojis?

Unicode example

In Python 3, all strings are Unicode strings, supporting characters, symbols, and emojis across languages without encoding worries:

Unicode codepoints

Use ord() to get Unicode codepoints for characters.

Why not use these codepoints directly as inputs?

Unicode reasons

One reason is the massive vocabulary (~150,000). Another is that the Unicode standard evolves; we want stability.

UTF-8

The Unicode Consortium defined UTF-8, UTF-16, and UTF-32. These convert Unicode text into binary data or byte streams. UTF-8 is the most common variable-length encoding.

UTF8 encoding

UTF-8 maps each codepoint to 1-4 bytes. UTF-32 is fixed-length.

Python strings have an encode method:

:::div{style=“max-width: 500px”} UTF8 encode method :::

This yields a bytes object. Converting to a list reveals the raw bytes:

:::div{style=“max-width: 100px”} UTF8 bytes :::

UTF-16 and UTF-32 are wasteful for simple ASCII/English, padding with many ‘0’ bytes:

UTF16 UTF32 comparison

Directly using raw UTF-8 bytes limits the vocabulary to 256, resulting in excessively long sequences. While this keeps the embedding table small, it’s highly inefficient for Transformers due to limited attention context.

We don’t want raw bytes. We want a larger, adjustable vocabulary built on top of UTF-8.

Byte-Pair Encoding (BPE)

BPE compresses byte sequences into variable amounts. It’s straightforward:

Given:

aaabbdaaaabac

Vocabulary size is 4, length 11 bytes. “aa” is most frequent; replace it with an unused byte like “Z”:

Zabdzabac
Z=aa

Next, replace “ab” with “Y”:

ZYdZYac
Y=ab
Z=aa

Finally, replace “ZY” with “X”:

XdXac
X=ZY
Y=ab
Z=aa

No more pairs occur more than once. Length is now 5, vocabulary size 7.

BPE compression

Unicode characters (like emojis) take multiple bytes in UTF-8, so the initial sequence can be long.

Python implementation:

def get_stats(ids):
	counts = {}
	for pair in zip(ids, ids[1:]): # Pythonic consecutive iteration
		counts[pair] = counts.get(pair, 0) + 1
	return counts
 
stats = get_stats(tokens)

Byte pair stats

Sorted byte pair frequencies.

Use chr() to see the characters for the top pair:

Byte pair chars

Implementing the merge (minting a new token):

def merge(ids, pair, idx):
	# Replace all consecutive occurrences of pair with new token idx
	newids = []
	i = 0
	while i < len(ids):
		if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
			newids.append(idx)
			i += 2
		else:
			newids.append(ids[i])	
			i += 1
	return newids

Merge operation

Running on full tokens:

Merge result

Sequence reduced by 20 bytes (616 to 596).

Iterating this shrinks sequence length but grows vocabulary size. This iteration count is a hyperparameter.

Setting vocab size to 276 (256 + 20 iterations):

BPE iterations

New tokens can themselves be merged, building a binary forest.

Final compression ratio:

Compression ratio

Decode

The Tokenizer is a pre-processing stage independent of the LLM, with its own training set. It acts as a translation layer between raw text and token sequences:

Tokenizer pipeline

Tokenizer training involves many languages and code. If the training set is rich in Chinese, more Chinese tokens are merged, resulting in shorter Chinese sequences—making the model more “Chinese-friendly” within its context limit. GPT-4o’s tokenizer excels at this:

Chinese friendly

Source: GitHub - gpt-tokens. Such issues are known as the “SolidGoldMagikarp problem.”

Decoding implementation:

Decode implementation

However, this implementation has a potential flaw:

Decode error

Error when decoding token 128.

UTF-8 is variable-length and doesn’t automatically handle arbitrary byte sequences. In our case, the raw bytes violate the standard:

UTF8 format

Fix by changing errors='strict' to 'replace':

Decode error handling

The replacement character indicates the LLM output an invalid token sequence.

Encode

Encode implementation

Using Claude 3.5 Sonnet to explain the lambda + list comprehension step:

Claude explanation

Setting length > 2 ensures that if only one character or an empty string remains, stats are empty, avoiding min() errors.

Testing other cases:

Encode test

We’ve built a simple Tokenizer where the core parameters are the merges dictionary. Now, let’s explore the complexities of advanced LLM Tokenizers.

GPT Series

In GPT-2, researchers didn’t use our simple approach. Frequent words like “dog” often follow various punctuation. BPE would merge “dog + punctuation,” cluttering the merges dictionary with many redundant records. This is suboptimal:

GPT2 word punctuation

The solution is prohibiting certain token combinations. GPT-2’s implementation: gpt-2/src/encoder.py#L53

Regex

GPT2 regex

Note: re here is the advanced regex module, not the standard library one.

Analyzing this regex:

’s|‘t|‘re|‘ve|‘m|‘ll|‘d: Common English contractions.
(space)?\p{L}+: One or more letters, optionally preceded by a space.
(space)?\p{N}+: One or more numbers, optionally preceded by a space.
(space)?[^\s\p{L}\p{N}]+: One or more punctuation marks (non-space, non-letter, non-number), optionally preceded by a space.
\s+(?!\S): One or more whitespaces not followed by a non-whitespace character (i.e., trailing spaces or pure whitespace strings). This subtle rule preserves extra spaces while correctly matching “space + word.”

Regex patterns

\s+: Trailing whitespaces.

Matching Result

Using findall on “Hello’ve world123 how’s are you!!!?“:

Regex matching

” Hello”, ” world”, ” how”, ” are”, and ” you” are matched as words with their preceding spaces.
“123” is matched as numbers.
“‘ve” and “‘s” are matched as contractions.
”!!!?” is matched as punctuation.

The text is split into chunks that are tokenized independently, preventing merges across chunk boundaries (like merging a word’s last letter with the next word’s space).

The original implementation missed uppercase cases like “HOW’S.” Adding case-insensitive matching changes the effect:

Python code tokenization

Case study with Python code:

Python code tokenization example

Not all chunks ran BPE during GPT-2 tokenizer training. Large space indentations often remain as individual space tokens (token 220). OpenAI likely enforced rules against merging these spaces. Since the training code isn’t open-source, the exact process remains a mystery.

Python tokenization result

Tiktoken

tiktoken provides inference code, not training code.

Tiktoken GPT2 regex

Tiktokenizer uses tiktoken in JS. GPT-4’s tokenizer changed the chunking regex, merging multiple spaces.

Optimized regex in tiktoken:

Tiktoken optimized regex

# Original
pattern = r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
 
# Optimized
pattern = r"'(?:[sdtm]|[rlve][re])| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"

Non-capturing groups (?: ...) and character classes [ ... ] improve efficiency for contractions.

GPT-4’s tokenizer regex changed significantly:

GPT4 regex

GPT-4o’s explanation:

pat_str = r"""
    '(?i:[sdmt]|ll|ve|re)          # Case-insensitive contractions
    | [^\r\n\p{L}\p{N}]?+\p{L}+    # Optional punctuation followed by letters
    | \p{N}{1,3}                   # 1-3 digits
    | ?[^\s\p{L}\p{N}]++[\r\n]*    # Optional space, punctuation, and optional newlines
    | \s*[\r\n]                    # Whitespace followed by a newline
    | \s+(?!\S)                    # Multiple spaces not followed by non-whitespace
    | \s+                          # Multiple spaces
"""

Key takeaways:

(?i:...) enables case-insensitivity within the group.
Numbers never merge beyond three digits to prevent long number sequences from becoming tokens.
Vocabulary expanded from ~50k to ~100k.

encoder.py

GPT-2’s encoder.py loads encoder.json and vocab.bpe.

Encoder json

encoder.json: Corresponds to our vocab (integer to bytes for decoding).
vocab.bpe: Corresponds to our merges list.

With these, we have a complete tokenizer for encoding and decoding.

The merge implementation iterates until no more merges are possible:

BPE merge implementation

OpenAI also uses byte_encoder and byte_decoder for handling multi-lingual and special characters.

Special Tokens

We can insert special tokens to separate data parts or introduce specific structures.

Special tokens

GPT-2 has 50,257 tokens. 256 raw bytes + 50,000 merges = 50,256. The last one?

It’s the ENDOFTEXT token:

Endoftext token

It separates documents in the training set (“End of document, next content is unrelated”). Implementation is in tiktoken/src/lib.rs.

Special tokens are also crucial for fine-tuning to delineate “Assistant” and “User” turns.

GPT3.5 special tokens

GPT-3.5-Turbo uses markers like ”<|im_start|>” (“im” for Imaginary) to track dialogue flow.

Token expansion:

Token expansion

GPT-4 adds four more special tokens (three FIM and one Control Token):

GPT4 special tokens

FIM = Fill in the Middle. See paper [2207.14255].

Adding a special token requires extending the Transformer’s embedding matrix and final classifier projection by one row—a “gentle surgery” that can even be done on a frozen base model.

minbpe

Visualizing GPT-4 merges vs. BPE trained on Taylor Swift’s Wikipedia entry reveals drastic differences in merge order:

Vocab visualization

GPT-4 prioritizes merging spaces, likely due to massive amounts of Python code in its training data.

SentencePiece

google/sentencepiece is another popular library (used by Llama, Mistral).

Key differences from Tiktoken:

Tiktoken: UTF-8 encodes strings into bytes, then merges bytes (byte-level).
SentencePiece: Merges codepoints directly. Rare codepoints map to an UNK token, or if byte fallback is enabled, they are UTF-8 encoded into byte tokens.

Byte fallback example

Byte fallback mechanism illustrated.

SentencePiece configuration is complex (e.g., Llama 2):

Sentencepiece config

Normalization (lowercase, removing double spaces, Unicode normalization) is common in translation/classification, but Andrej prefers avoiding it for language models to preserve raw data forms.

Byte Fallback

Training a 400-vocab BPE on English text with byte fallback:

Special tokens.
256 byte tokens.

Byte fallback tokens

Merged BPE tokens.

BPE tokens

Remaining codepoint tokens.

Codepoint tokens

character_coverage=0.99995 ignores extremely rare codepoints to minimize UNK frequency.

With byte fallback, Korean characters (absent from training) are UTF-8 encoded into known byte tokens instead of UNK:

Byte fallback korean

Korean characters use tokens 3-258.

Without byte fallback, the vocabulary gains more merge tokens at the cost of Korean becoming UNK:

No byte fallback result

Spaces become _. Also, add_dummy_prefix=True adds a leading space. Motivation:

In Tiktoken, “world” at the start of a sentence differs from ” world” in the middle. The model must learn they are conceptually similar:

Tiktoken space handling

A dummy prefix ensures both use the same “(space)world” token.

vocab_size

Where does vocab_size appear in the network?

Vocab size locations

Token Embedding Table: A 2D array with vocab_size rows. Each row is a vector optimized during training. Larger vocab means a larger embedding table.
lm_head Layer: Performs logistic regression for the next token probability. Each additional token adds a dot product operation.

Vocab size considerations

Why not an infinite vocabulary?

Computational cost: Larger embedding tables and linear layers.
Under-training: Rare tokens appear less frequently, leading to poorly trained vectors.
Information overload: Too much compression might squeeze too much information into a single token, overwhelming the Transformer’s processing capacity.

State-of-the-art (SOTA) architectures usually choose ~10,000 to 100,000.

Extending Pre-trained Vocabularies

Common during fine-tuning (e.g., adding [USER] and [ASSISTANT] markers). Just append rows to the embedding table and extend the final linear layer—a “溫和” (mild) modification. Base weights can even remain frozen.

Gist Tokens

[2304.08467] Learning to Compress Prompts with Gist Tokens introduces gist tokens to compress long prompts during instruction tuning, reducing costs while maintaining performance.

Multimodal

Multimodal models (text, image, audio) often keep the Transformer architecture unchanged and just use specialized tokenization to treat multimodal inputs as pseudo-text tokens:

Multimodal tokenization

Sora (OpenAI) tokenizes video into discrete patches for autoregressive processing:

Sora tokenization

Prompt Attack

Prompt attack

Avoid parsing special tokens within user input to prevent “prompt injection” or tampering.

SolidGoldMagikarp

SolidGoldMagikarp — LessWrong. Clustering reveals “glitch tokens”:

SolidGoldMagikarp token

Including these causes erratic behavior or safety violations. Why?

Reddit user “SolidGoldMagikarp” was so frequently mentioned that the name became a single BPE token. Since this token was never activated or optimized during model training (appearing only in the tokenizer dataset), inputting it leads to “undefined behavior,” like calling unallocated memory in C++.

Andrej’s Final Summary

Don’t ignore tokenization: It’s full of potential bugs, security risks, and stability issues.
Eliminating this step would be a great achievement.
In practice:
- Use TikToken for efficient inference.
- Use SentencePiece BPE for training vocabularies from scratch (but watch the configs).
- Switch to MinBPE once it matches SentencePiece efficiency :)