Overview
The Prompt Processing Pipeline describes the end-to-end journey of a user's text query through a modern AI system. Understanding this pipeline reveals how systems like GPT-5.2, Claude 4.5, and Gemini 3 transform human language into intelligent responses.
The Complete Journey
User Input: "What is quantum computing?"
β
βββββββββββββββββββββββββββββββββββββββββββ
β 1. TOKENIZATION β
β Text β Token IDs β
β [2061, 374, 31228, 25213, 30] β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 2. EMBEDDING LOOKUP β
β Token IDs β Dense Vectors β
β Each token β 4096-dim vector β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 3. POSITIONAL ENCODING β
β Add position information β
β Token 1 at pos 0, Token 2 at pos 1 β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 4. TRANSFORMER LAYERS (Γ96+) β
β Self-Attention β FFN β Normalize β
β Each layer refines understanding β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 5. OUTPUT PROJECTION β
β Final hidden state β Vocabulary β
β Predict next token probabilities β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β 6. SAMPLING & GENERATION β
β Select token β Append β Repeat β
β Until <EOS> or max length β
βββββββββββββββββββββββββββββββββββββββββββ
β
Response: "Quantum computing is a type of..."
Stage 1: Tokenization
The prompt is broken into processable units:
| Input | Tokens | Token IDs |
|---|---|---|
| "What" | What |
2061 |
| " is" | is |
374 |
| " quantum" | quantum |
31228 |
| " computing" | computing |
25213 |
| "?" | ? |
30 |
Stage 2: Embedding Lookup
Each token ID maps to a learned vector:
- Vocabulary size: 50K-200K tokens
- Embedding dimension: 4096-16384
- Total parameters: ~800M just for embeddings
Stage 3: Positional Encoding
Transformers have no inherent position sense. Position is injected via:
- Sinusoidal encoding: Fixed mathematical patterns
- Learned positions: Trained position embeddings
- RoPE: Rotary Position Embedding (used by Llama, GPT-5.2)
Stage 4: Transformer Layers
The core computation, repeated 96-128 times:
Self-Attention
Q = input Γ W_Q (Query: "What am I looking for?")
K = input Γ W_K (Key: "What do I contain?")
V = input Γ W_V (Value: "What information do I have?")
Attention = softmax(Q Γ K^T / βd) Γ V
Feed-Forward Network (FFN)
In Mixture of Experts models:
- Router selects 2-8 experts from ~64 available
- Experts are specialized FFN blocks
- Sparse activation keeps compute manageable
Layer Normalization
Stabilizes training and inference at each layer.
Stage 5: Output Projection
The final hidden state (4096-dim) is projected to vocabulary size:
- Logits: Raw scores for each possible next token
- Temperature: Controls randomness (lower = more deterministic)
- Top-p/Top-k: Filter low-probability tokens
Stage 6: Autoregressive Generation
Models generate one token at a time:
- Process full prompt
- Predict next token
- Append to sequence
- Repeat until done
This is why responses stream token-by-token.
Pipeline Optimizations
| Technique | Purpose | Impact |
|---|---|---|
| KV Cache | Store key/value states | 10-100x faster generation |
| Flash Attention | Memory-efficient attention | Longer contexts possible |
| Speculative Decoding | Predict multiple tokens | 2-3x faster |
| Continuous Batching | Dynamic request handling | Higher throughput |
Latency Breakdown
For a typical GPT-5.2 API call:
| Stage | Time | Notes |
|---|---|---|
| Tokenization | <1ms | CPU-based |
| Prefill (prompt) | 50-200ms | Processes all input tokens |
| Per-token decode | 10-30ms | Each output token |
| Detokenization | <1ms | Tokens back to text |
Total for 100-token response: ~1-3 seconds