Overview
The Prompt Processing Pipeline describes the end-to-end journey of a user's text query through a modern AI system. Understanding this pipeline reveals how systems like GPT-5.2, Claude 4.5, and Gemini 3 transform human language into intelligent responses.
The Complete Journey
User Input: "What is quantum computing?"
↓
┌─────────────────────────────────────────┐
│ 1. TOKENIZATION │
│ Text → Token IDs │
│ [2061, 374, 31228, 25213, 30] │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 2. EMBEDDING LOOKUP │
│ Token IDs → Dense Vectors │
│ Each token → 4096-dim vector │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 3. POSITIONAL ENCODING │
│ Add position information │
│ Token 1 at pos 0, Token 2 at pos 1 │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 4. TRANSFORMER LAYERS (×96+) │
│ Self-Attention → FFN → Normalize │
│ Each layer refines understanding │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 5. OUTPUT PROJECTION │
│ Final hidden state → Vocabulary │
│ Predict next token probabilities │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ 6. SAMPLING & GENERATION │
│ Select token → Append → Repeat │
│ Until <EOS> or max length │
└─────────────────────────────────────────┘
↓
Response: "Quantum computing is a type of..."
Stage 1: Tokenization
The prompt is broken into processable units:
| Input | Tokens | Token IDs |
|---|---|---|
| "What" | What |
2061 |
| " is" | is |
374 |
| " quantum" | quantum |
31228 |
| " computing" | computing |
25213 |
| "?" | ? |
30 |
Stage 2: Embedding Lookup
Each token ID maps to a learned vector:
- Vocabulary size: 50K-200K tokens
- Embedding dimension: 4096-16384
- Total parameters: ~800M just for embeddings
Stage 3: Positional Encoding
Transformers have no inherent position sense. Position is injected via:
- Sinusoidal encoding: Fixed mathematical patterns
- Learned positions: Trained position embeddings
- RoPE: Rotary Position Embedding (used by Llama, GPT-5.2)
Stage 4: Transformer Layers
The core computation, repeated 96-128 times:
Self-Attention
Q = input × W_Q (Query: "What am I looking for?")
K = input × W_K (Key: "What do I contain?")
V = input × W_V (Value: "What information do I have?")
Attention = softmax(Q × K^T / √d) × V
Feed-Forward Network (FFN)
In Mixture of Experts models:
- Router selects 2-8 experts from ~64 available
- Experts are specialized FFN blocks
- Sparse activation keeps compute manageable
Layer Normalization
Stabilizes training and inference at each layer.
Stage 5: Output Projection
The final hidden state (4096-dim) is projected to vocabulary size:
- Logits: Raw scores for each possible next token
- Temperature: Controls randomness (lower = more deterministic)
- Top-p/Top-k: Filter low-probability tokens
Stage 6: Autoregressive Generation
Models generate one token at a time:
- Process full prompt
- Predict next token
- Append to sequence
- Repeat until done
This is why responses stream token-by-token.
Pipeline Optimizations
| Technique | Purpose | Impact |
|---|---|---|
| KV Cache | Store key/value states | 10-100x faster generation |
| Flash Attention | Memory-efficient attention | Longer contexts possible |
| Speculative Decoding | Predict multiple tokens | 2-3x faster |
| Continuous Batching | Dynamic request handling | Higher throughput |
Latency Breakdown
For a typical GPT-5.2 API call:
| Stage | Time | Notes |
|---|---|---|
| Tokenization | <1ms | CPU-based |
| Prefill (prompt) | 50-200ms | Processes all input tokens |
| Per-token decode | 10-30ms | Each output token |
| Detokenization | <1ms | Tokens back to text |
Total for 100-token response: ~1-3 seconds