⚙️ AI Infrastructure intermediate

Quantization

Technique to reduce LLM memory usage by representing model weights with lower precision numbers (e.g., 4-bit instead of 16-bit).

Quantization compresses neural network models by reducing the numerical precision of their weights, dramatically lowering memory requirements and often improving inference speed.

Why Quantization Matters

A 70B parameter model in full precision (FP16) requires ~140GB of memory. With 4-bit quantization, that same model fits in ~35GB - making it runnable on consumer GPUs.

Quantization Levels

  • FP16/BF16 (16-bit): Full precision, best quality, highest memory
  • INT8 (8-bit): Half the memory, minimal quality loss
  • Q8_0: 8-bit quantization, excellent quality
  • Q6_K: 6-bit, very good quality/size balance
  • Q5_K_M: 5-bit medium, popular sweet spot
  • Q4_K_M: 4-bit medium, good for limited VRAM
  • Q4_0: Basic 4-bit, more quality loss
  • Q2_K: 2-bit, significant quality degradation

Quantization Methods

  • GPTQ: GPU-focused, requires calibration data
  • AWQ: Activation-aware, preserves important weights
  • GGUF/llama.cpp: CPU-friendly, various K-quant methods
  • bitsandbytes: Easy integration with Hugging Face
  • EXL2: Optimized for ExLlamaV2, variable bit rates

Trade-offs

Lower precision means:

  • ✅ Less VRAM/RAM required
  • ✅ Faster inference (sometimes)
  • ✅ Run larger models on consumer hardware
  • ❌ Some quality/accuracy loss
  • ❌ May affect complex reasoning tasks

The K-quant methods (Q4_K_M, Q5_K_M, Q6_K) use mixed precision - keeping important weights at higher precision - offering better quality than uniform quantization.

// Example Usage

Running a Q4_K_M quantized Llama 3 70B model on an RTX 4090 (24GB VRAM) that would otherwise require 140GB in full precision.