Quantization compresses neural network models by reducing the numerical precision of their weights, dramatically lowering memory requirements and often improving inference speed.
Why Quantization Matters
A 70B parameter model in full precision (FP16) requires ~140GB of memory. With 4-bit quantization, that same model fits in ~35GB - making it runnable on consumer GPUs.
Quantization Levels
- FP16/BF16 (16-bit): Full precision, best quality, highest memory
- INT8 (8-bit): Half the memory, minimal quality loss
- Q8_0: 8-bit quantization, excellent quality
- Q6_K: 6-bit, very good quality/size balance
- Q5_K_M: 5-bit medium, popular sweet spot
- Q4_K_M: 4-bit medium, good for limited VRAM
- Q4_0: Basic 4-bit, more quality loss
- Q2_K: 2-bit, significant quality degradation
Quantization Methods
- GPTQ: GPU-focused, requires calibration data
- AWQ: Activation-aware, preserves important weights
- GGUF/llama.cpp: CPU-friendly, various K-quant methods
- bitsandbytes: Easy integration with Hugging Face
- EXL2: Optimized for ExLlamaV2, variable bit rates
Trade-offs
Lower precision means:
- ✅ Less VRAM/RAM required
- ✅ Faster inference (sometimes)
- ✅ Run larger models on consumer hardware
- ❌ Some quality/accuracy loss
- ❌ May affect complex reasoning tasks
The K-quant methods (Q4_K_M, Q5_K_M, Q6_K) use mixed precision - keeping important weights at higher precision - offering better quality than uniform quantization.