Transformers revolutionized natural language processing when introduced in the 2017 paper "Attention Is All You Need." Unlike recurrent neural networks (RNNs), transformers process entire sequences in parallel using self-attention mechanisms, enabling them to capture long-range dependencies efficiently. Key components include: multi-head attention (allowing the model to focus on different parts of input simultaneously), positional encoding (since transformers have no inherent sequence order), and layer normalization. The architecture consists of encoder and decoder stacks, though many modern models use only one (BERT uses encoder-only, GPT uses decoder-only). Transformers enabled the scaling laws that led to today's powerful LLMs.
🧠 AI & LLMs intermediate
Transformer
Neural network architecture using self-attention mechanisms, the foundation of modern LLMs like GPT and Claude.
5
views
</> Related Terms
Token
The basic unit of text that LLMs process - typically a word, subword, or character.
LLM (Large Language Model)
AI models trained on massive text datasets to understand and generate human-like text.
Context Window
The maximum amount of text (measured in tokens) that an LLM can process in a single interaction.