Prompt Processing Pipeline

Overview

The Prompt Processing Pipeline describes the end-to-end journey of a user's text query through a modern AI system. Understanding this pipeline reveals how systems like GPT-5.2, Claude 4.5, and Gemini 3 transform human language into intelligent responses.

The Complete Journey

User Input: "What is quantum computing?"
     ↓
┌─────────────────────────────────────────┐
│  1. TOKENIZATION                        │
│     Text → Token IDs                    │
│     [2061, 374, 31228, 25213, 30]       │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  2. EMBEDDING LOOKUP                    │
│     Token IDs → Dense Vectors           │
│     Each token → 4096-dim vector        │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  3. POSITIONAL ENCODING                 │
│     Add position information            │
│     Token 1 at pos 0, Token 2 at pos 1  │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  4. TRANSFORMER LAYERS (×96+)           │
│     Self-Attention → FFN → Normalize    │
│     Each layer refines understanding    │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  5. OUTPUT PROJECTION                   │
│     Final hidden state → Vocabulary     │
│     Predict next token probabilities    │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  6. SAMPLING & GENERATION               │
│     Select token → Append → Repeat      │
│     Until <EOS> or max length           │
└─────────────────────────────────────────┘
     ↓
Response: "Quantum computing is a type of..."

Stage 1: Tokenization

The prompt is broken into processable units:

Input	Tokens	Token IDs
"What"	`What`	2061
" is"	`is`	374
" quantum"	`quantum`	31228
" computing"	`computing`	25213
"?"	`?`	30

Stage 2: Embedding Lookup

Each token ID maps to a learned vector:

Vocabulary size: 50K-200K tokens
Embedding dimension: 4096-16384
Total parameters: ~800M just for embeddings

Stage 3: Positional Encoding

Transformers have no inherent position sense. Position is injected via:

Sinusoidal encoding: Fixed mathematical patterns
Learned positions: Trained position embeddings
RoPE: Rotary Position Embedding (used by Llama, GPT-5.2)

Stage 4: Transformer Layers

The core computation, repeated 96-128 times:

Self-Attention

Q = input × W_Q  (Query: "What am I looking for?")
K = input × W_K  (Key: "What do I contain?")
V = input × W_V  (Value: "What information do I have?")

Attention = softmax(Q × K^T / √d) × V

Feed-Forward Network (FFN)

In Mixture of Experts models:

Router selects 2-8 experts from ~64 available
Experts are specialized FFN blocks
Sparse activation keeps compute manageable

Layer Normalization

Stabilizes training and inference at each layer.

Stage 5: Output Projection

The final hidden state (4096-dim) is projected to vocabulary size:

Logits: Raw scores for each possible next token
Temperature: Controls randomness (lower = more deterministic)
Top-p/Top-k: Filter low-probability tokens

Stage 6: Autoregressive Generation

Models generate one token at a time:

Process full prompt
Predict next token
Append to sequence
Repeat until done

This is why responses stream token-by-token.

Pipeline Optimizations

Technique	Purpose	Impact
KV Cache	Store key/value states	10-100x faster generation
Flash Attention	Memory-efficient attention	Longer contexts possible
Speculative Decoding	Predict multiple tokens	2-3x faster
Continuous Batching	Dynamic request handling	Higher throughput

Latency Breakdown

For a typical GPT-5.2 API call:

Stage	Time	Notes
Tokenization	<1ms	CPU-based
Prefill (prompt)	50-200ms	Processes all input tokens
Per-token decode	10-30ms	Each output token
Detokenization	<1ms	Tokens back to text

Total for 100-token response: ~1-3 seconds