Quantization

Quantization compresses neural network models by reducing the numerical precision of their weights, dramatically lowering memory requirements and often improving inference speed.

Why Quantization Matters

A 70B parameter model in full precision (FP16) requires ~140GB of memory. With 4-bit quantization, that same model fits in ~35GB - making it runnable on consumer GPUs.

Quantization Levels

FP16/BF16 (16-bit): Full precision, best quality, highest memory
INT8 (8-bit): Half the memory, minimal quality loss
Q8_0: 8-bit quantization, excellent quality
Q6_K: 6-bit, very good quality/size balance
Q5_K_M: 5-bit medium, popular sweet spot
Q4_K_M: 4-bit medium, good for limited VRAM
Q4_0: Basic 4-bit, more quality loss
Q2_K: 2-bit, significant quality degradation

Quantization Methods

GPTQ: GPU-focused, requires calibration data
AWQ: Activation-aware, preserves important weights
GGUF/llama.cpp: CPU-friendly, various K-quant methods
bitsandbytes: Easy integration with Hugging Face
EXL2: Optimized for ExLlamaV2, variable bit rates

Trade-offs

Lower precision means:

✅ Less VRAM/RAM required
✅ Faster inference (sometimes)
✅ Run larger models on consumer hardware
❌ Some quality/accuracy loss
❌ May affect complex reasoning tasks

The K-quant methods (Q4_K_M, Q5_K_M, Q6_K) use mixed precision - keeping important weights at higher precision - offering better quality than uniform quantization.

Why Quantization Matters

Quantization Levels

Quantization Methods

Trade-offs

// Example Usage

</> Related Terms

LLM (Large Language Model)

Inference Server

GGUF (GPT-Generated Unified Format)

[] More in AI Infrastructure