🧠 AI & LLMs + ⚙️ AI Infrastructure advanced

Prompt Processing Pipeline

The complete computational journey of a user prompt through an AI system: from text tokenization and embedding lookup, through transformer attention layers and expert routing, to final token prediction and response generation.

Overview

The Prompt Processing Pipeline describes the end-to-end journey of a user's text query through a modern AI system. Understanding this pipeline reveals how systems like GPT-5.2, Claude 4.5, and Gemini 3 transform human language into intelligent responses.

The Complete Journey

User Input: "What is quantum computing?"
     ↓
┌─────────────────────────────────────────┐
│  1. TOKENIZATION                        │
│     Text → Token IDs                    │
│     [2061, 374, 31228, 25213, 30]       │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  2. EMBEDDING LOOKUP                    │
│     Token IDs → Dense Vectors           │
│     Each token → 4096-dim vector        │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  3. POSITIONAL ENCODING                 │
│     Add position information            │
│     Token 1 at pos 0, Token 2 at pos 1  │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  4. TRANSFORMER LAYERS (×96+)           │
│     Self-Attention → FFN → Normalize    │
│     Each layer refines understanding    │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  5. OUTPUT PROJECTION                   │
│     Final hidden state → Vocabulary     │
│     Predict next token probabilities    │
└─────────────────────────────────────────┘
     ↓
┌─────────────────────────────────────────┐
│  6. SAMPLING & GENERATION               │
│     Select token → Append → Repeat      │
│     Until <EOS> or max length           │
└─────────────────────────────────────────┘
     ↓
Response: "Quantum computing is a type of..."

Stage 1: Tokenization

The prompt is broken into processable units:

Input Tokens Token IDs
"What" What 2061
" is" is 374
" quantum" quantum 31228
" computing" computing 25213
"?" ? 30

Stage 2: Embedding Lookup

Each token ID maps to a learned vector:

  • Vocabulary size: 50K-200K tokens
  • Embedding dimension: 4096-16384
  • Total parameters: ~800M just for embeddings

Stage 3: Positional Encoding

Transformers have no inherent position sense. Position is injected via:

  • Sinusoidal encoding: Fixed mathematical patterns
  • Learned positions: Trained position embeddings
  • RoPE: Rotary Position Embedding (used by Llama, GPT-5.2)

Stage 4: Transformer Layers

The core computation, repeated 96-128 times:

Self-Attention

Q = input × W_Q  (Query: "What am I looking for?")
K = input × W_K  (Key: "What do I contain?")
V = input × W_V  (Value: "What information do I have?")

Attention = softmax(Q × K^T / √d) × V

Feed-Forward Network (FFN)

In Mixture of Experts models:

  • Router selects 2-8 experts from ~64 available
  • Experts are specialized FFN blocks
  • Sparse activation keeps compute manageable

Layer Normalization

Stabilizes training and inference at each layer.

Stage 5: Output Projection

The final hidden state (4096-dim) is projected to vocabulary size:

  • Logits: Raw scores for each possible next token
  • Temperature: Controls randomness (lower = more deterministic)
  • Top-p/Top-k: Filter low-probability tokens

Stage 6: Autoregressive Generation

Models generate one token at a time:

  1. Process full prompt
  2. Predict next token
  3. Append to sequence
  4. Repeat until done

This is why responses stream token-by-token.

Pipeline Optimizations

Technique Purpose Impact
KV Cache Store key/value states 10-100x faster generation
Flash Attention Memory-efficient attention Longer contexts possible
Speculative Decoding Predict multiple tokens 2-3x faster
Continuous Batching Dynamic request handling Higher throughput

Latency Breakdown

For a typical GPT-5.2 API call:

Stage Time Notes
Tokenization <1ms CPU-based
Prefill (prompt) 50-200ms Processes all input tokens
Per-token decode 10-30ms Each output token
Detokenization <1ms Tokens back to text

Total for 100-token response: ~1-3 seconds

// Example Usage

When you send "Explain quantum computing" to Claude 4.5, the prompt travels through tokenization (5 tokens), embedding lookup, 96+ transformer layers with MoE routing, and autoregressive generation—all in about 2 seconds for a full response.