Curiosity · Concept
Tokenization
Tokenization converts raw text into a sequence of integer token IDs that a model can process. Modern LLMs use subword tokenizers (BPE, WordPiece, SentencePiece) that split common words whole and rare words into pieces. The choice of tokenizer affects context length, cost, multilingual quality, and even model behavior.
Quick reference
- Proficiency
- Beginner
- Also known as
- BPE, subword tokenization, text encoding
- Prerequisites
- Basic text processing
Frequently asked questions
What is tokenization in LLMs?
It is the step that turns a string of text into a list of integer token IDs that the model can embed. Tokens are usually subwords, so 'unhappiness' might become ['un', 'happi', 'ness'].
What's the difference between BPE and SentencePiece?
BPE starts from characters and iteratively merges the most frequent adjacent pair until a vocab size is reached. SentencePiece is a framework that can use BPE or unigram models directly on raw Unicode, without requiring pre-tokenization by whitespace — better for languages without word boundaries like Japanese or Chinese.
Why does tokenization matter for cost and limits?
LLM pricing and context windows are measured in tokens, not characters or words. A document that is 1,000 English words is roughly 1,300 tokens; the same content in Hindi or code can be 2-4x more tokens, which multiplies cost and eats the context window.
Can tokenization cause model errors?
Yes. Tokenizer artifacts famously cause issues like LLMs struggling with character counting ('how many r's in strawberry'), arithmetic on multi-digit numbers, or handling of trailing whitespace. These are not reasoning failures — they are artifacts of how the text was split.
Sources
- Hugging Face — Tokenizers summary — accessed 2026-04-20
- Sennrich et al. — Neural Machine Translation of Rare Words with Subword Units (BPE) — accessed 2026-04-20
- Kudo & Richardson — SentencePiece — accessed 2026-04-20