Curiosity · Concept

Tokenization

Tokenization converts raw text into a sequence of integer token IDs that a model can process. Modern LLMs use subword tokenizers (BPE, WordPiece, SentencePiece) that split common words whole and rare words into pieces. The choice of tokenizer affects context length, cost, multilingual quality, and even model behavior.

Quick reference

Proficiency: Beginner
Also known as: BPE, subword tokenization, text encoding
Prerequisites: Basic text processing

Frequently asked questions

What is tokenization in LLMs?

It is the step that turns a string of text into a list of integer token IDs that the model can embed. Tokens are usually subwords, so 'unhappiness' might become ['un', 'happi', 'ness'].

What's the difference between BPE and SentencePiece?

BPE starts from characters and iteratively merges the most frequent adjacent pair until a vocab size is reached. SentencePiece is a framework that can use BPE or unigram models directly on raw Unicode, without requiring pre-tokenization by whitespace — better for languages without word boundaries like Japanese or Chinese.

Why does tokenization matter for cost and limits?

LLM pricing and context windows are measured in tokens, not characters or words. A document that is 1,000 English words is roughly 1,300 tokens; the same content in Hindi or code can be 2-4x more tokens, which multiplies cost and eats the context window.

Can tokenization cause model errors?

Yes. Tokenizer artifacts famously cause issues like LLMs struggling with character counting ('how many r's in strawberry'), arithmetic on multi-digit numbers, or handling of trailing whitespace. These are not reasoning failures — they are artifacts of how the text was split.

Sources

Hugging Face — Tokenizers summary — accessed 2026-04-20
Sennrich et al. — Neural Machine Translation of Rare Words with Subword Units (BPE) — accessed 2026-04-20
Kudo & Richardson — SentencePiece — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related