Tokenization

Overview

Tokenization is the fundamental first step in how AI "reads" text. It transforms human-readable text into sequences of numbers that neural networks can process. Without tokenization, language models like GPT-5.2, Claude 4.5, and Gemini 3 could not understand or generate text.

Why Tokenization Matters

Neural networks operate on numbers, not letters. Tokenization bridges this gap:

"Hello world!" → [15496, 995, 0] → Neural Network → Output

Tokenization Algorithms

Byte Pair Encoding (BPE)

The most common approach, used by GPT models:

Start with characters: Split text into individual characters
Find common pairs: Identify frequently occurring character pairs
Merge iteratively: Replace common pairs with single tokens
Build vocabulary: Repeat until reaching target vocabulary size

Step	Text State	New Token
0	`l o w e r`	-
1	`lo w e r`	`lo`
2	`low e r`	`low`
3	`lower`	`lower`

SentencePiece

Google's language-agnostic tokenizer:

Works directly on raw text (no pre-tokenization)
Handles any language without language-specific rules
Used by Gemini 3, T5, and many multilingual models

WordPiece

BERT-style tokenization:

Similar to BPE but uses likelihood-based merging
Marks subwords with ## prefix
"playing" → ["play", "##ing"]

The Tokenization Pipeline

Input Text: "GPT-5.2 is amazing!"
     ↓
[1] Normalize: Handle unicode, case, whitespace
     ↓
[2] Pre-tokenize: Split on spaces/punctuation (optional)
     ↓
[3] Subword Tokenize: Apply BPE/SentencePiece
     ↓
[4] Vocabulary Lookup: Convert tokens to IDs
     ↓
Output: [38, 2898, 12, 20, 13, 17, 374, 8056, 0]

Vocabulary Size Trade-offs

Vocabulary Size	Pros	Cons
Small (8K-16K)	Compact embeddings, fast	Many tokens per word
Medium (32K-50K)	Balanced efficiency	Standard choice
Large (100K+)	Fewer tokens, better for multilingual	Large embedding matrix

Token Economics

Tokens directly impact AI costs:

Text	GPT-4 Tokens	Claude 4.5 Tokens
"Hello"	1	1
"Supercalifragilisticexpialidocious"	7	6
100 words of English	~130	~125
100 words of Chinese	~180	~170

Special Tokens

Models use reserved tokens for structure:

Token	Purpose	Example
`<BOS>`	Beginning of sequence	Start of prompt
`<EOS>`	End of sequence	Generation stop
`<PAD>`	Padding	Batch alignment
`<UNK>`	Unknown token	Out-of-vocabulary
`<	im_start	>`

Tokenization Challenges

Multilingual text: Some scripts need more tokens per concept
Code: Programming syntax can tokenize inefficiently
Numbers: "1000" vs "1,000" tokenize differently
Emojis: May require multiple tokens
Typos: Create out-of-vocabulary tokens