🧠 AI & LLMs beginner

Token

The basic unit of text that LLMs process - typically a word, subword, or character.

Tokens are how LLMs see text. Tokenization breaks text into pieces the model can process. Common tokenizers (like BPE - Byte Pair Encoding) create vocabularies of 30K-100K tokens including whole words, subwords, and characters. Hello world might be 2 tokens, but tokenization might be 3+ tokens. Token counts matter because: they determine context window usage, API pricing is per-token, and generation speed is tokens-per-second. Different models use different tokenizers - the same text may have different token counts across models. Understanding tokenization helps optimize prompts and estimate costs.