Overview
Tokenization is the fundamental first step in how AI "reads" text. It transforms human-readable text into sequences of numbers that neural networks can process. Without tokenization, language models like GPT-5.2, Claude 4.5, and Gemini 3 could not understand or generate text.
Why Tokenization Matters
Neural networks operate on numbers, not letters. Tokenization bridges this gap:
"Hello world!" → [15496, 995, 0] → Neural Network → Output
Tokenization Algorithms
Byte Pair Encoding (BPE)
The most common approach, used by GPT models:
- Start with characters: Split text into individual characters
- Find common pairs: Identify frequently occurring character pairs
- Merge iteratively: Replace common pairs with single tokens
- Build vocabulary: Repeat until reaching target vocabulary size
| Step | Text State | New Token |
|---|---|---|
| 0 | l o w e r |
- |
| 1 | lo w e r |
lo |
| 2 | low e r |
low |
| 3 | lower |
lower |
SentencePiece
Google's language-agnostic tokenizer:
- Works directly on raw text (no pre-tokenization)
- Handles any language without language-specific rules
- Used by Gemini 3, T5, and many multilingual models
WordPiece
BERT-style tokenization:
- Similar to BPE but uses likelihood-based merging
- Marks subwords with
##prefix "playing"→["play", "##ing"]
The Tokenization Pipeline
Input Text: "GPT-5.2 is amazing!"
↓
[1] Normalize: Handle unicode, case, whitespace
↓
[2] Pre-tokenize: Split on spaces/punctuation (optional)
↓
[3] Subword Tokenize: Apply BPE/SentencePiece
↓
[4] Vocabulary Lookup: Convert tokens to IDs
↓
Output: [38, 2898, 12, 20, 13, 17, 374, 8056, 0]
Vocabulary Size Trade-offs
| Vocabulary Size | Pros | Cons |
|---|---|---|
| Small (8K-16K) | Compact embeddings, fast | Many tokens per word |
| Medium (32K-50K) | Balanced efficiency | Standard choice |
| Large (100K+) | Fewer tokens, better for multilingual | Large embedding matrix |
Token Economics
Tokens directly impact AI costs:
| Text | GPT-4 Tokens | Claude 4.5 Tokens |
|---|---|---|
| "Hello" | 1 | 1 |
| "Supercalifragilisticexpialidocious" | 7 | 6 |
| 100 words of English | ~130 | ~125 |
| 100 words of Chinese | ~180 | ~170 |
Special Tokens
Models use reserved tokens for structure:
| Token | Purpose | Example |
|---|---|---|
<BOS> |
Beginning of sequence | Start of prompt |
<EOS> |
End of sequence | Generation stop |
<PAD> |
Padding | Batch alignment |
<UNK> |
Unknown token | Out-of-vocabulary |
| `< | im_start | >` |
Tokenization Challenges
- Multilingual text: Some scripts need more tokens per concept
- Code: Programming syntax can tokenize inefficiently
- Numbers:
"1000"vs"1,000"tokenize differently - Emojis: May require multiple tokens
- Typos: Create out-of-vocabulary tokens