🧠 AI & LLMs + 💻 Development beginner

Tokenization

The process of breaking text into smaller units called tokens (words, subwords, or characters) that AI models can process numerically, using algorithms like Byte Pair Encoding (BPE) or SentencePiece.

Overview

Tokenization is the fundamental first step in how AI "reads" text. It transforms human-readable text into sequences of numbers that neural networks can process. Without tokenization, language models like GPT-5.2, Claude 4.5, and Gemini 3 could not understand or generate text.

Why Tokenization Matters

Neural networks operate on numbers, not letters. Tokenization bridges this gap:

"Hello world!" → [15496, 995, 0] → Neural Network → Output

Tokenization Algorithms

Byte Pair Encoding (BPE)

The most common approach, used by GPT models:

  1. Start with characters: Split text into individual characters
  2. Find common pairs: Identify frequently occurring character pairs
  3. Merge iteratively: Replace common pairs with single tokens
  4. Build vocabulary: Repeat until reaching target vocabulary size
Step Text State New Token
0 l o w e r -
1 lo w e r lo
2 low e r low
3 lower lower

SentencePiece

Google's language-agnostic tokenizer:

  • Works directly on raw text (no pre-tokenization)
  • Handles any language without language-specific rules
  • Used by Gemini 3, T5, and many multilingual models

WordPiece

BERT-style tokenization:

  • Similar to BPE but uses likelihood-based merging
  • Marks subwords with ## prefix
  • "playing"["play", "##ing"]

The Tokenization Pipeline

Input Text: "GPT-5.2 is amazing!"
     ↓
[1] Normalize: Handle unicode, case, whitespace
     ↓
[2] Pre-tokenize: Split on spaces/punctuation (optional)
     ↓
[3] Subword Tokenize: Apply BPE/SentencePiece
     ↓
[4] Vocabulary Lookup: Convert tokens to IDs
     ↓
Output: [38, 2898, 12, 20, 13, 17, 374, 8056, 0]

Vocabulary Size Trade-offs

Vocabulary Size Pros Cons
Small (8K-16K) Compact embeddings, fast Many tokens per word
Medium (32K-50K) Balanced efficiency Standard choice
Large (100K+) Fewer tokens, better for multilingual Large embedding matrix

Token Economics

Tokens directly impact AI costs:

Text GPT-4 Tokens Claude 4.5 Tokens
"Hello" 1 1
"Supercalifragilisticexpialidocious" 7 6
100 words of English ~130 ~125
100 words of Chinese ~180 ~170

Special Tokens

Models use reserved tokens for structure:

Token Purpose Example
<BOS> Beginning of sequence Start of prompt
<EOS> End of sequence Generation stop
<PAD> Padding Batch alignment
<UNK> Unknown token Out-of-vocabulary
`< im_start >`

Tokenization Challenges

  1. Multilingual text: Some scripts need more tokens per concept
  2. Code: Programming syntax can tokenize inefficiently
  3. Numbers: "1000" vs "1,000" tokenize differently
  4. Emojis: May require multiple tokens
  5. Typos: Create out-of-vocabulary tokens

// Example Usage

When you type "Hello world" to GPT-5.2, the tokenizer first converts it to tokens like [15496, 995]—numerical IDs that the model can process through its neural network layers.