GGUF is the modern standard format for distributing quantized language models, replacing the older GGML format. It's the native format for llama.cpp and widely supported across the local AI ecosystem.
Key Features
- Single-file distribution: Model weights, tokenizer, and metadata in one file
- Memory-mapped loading: Fast startup, efficient memory usage
- Flexible quantization: Supports various quantization levels (Q2 through Q8)
- Metadata storage: Includes model architecture, training info, tokenizer config
- Cross-platform: Works on CPU (x86, ARM) and GPU (CUDA, Metal, ROCm)
File Naming Convention
GGUF files typically follow this pattern:
modelname-size-quantization.gguf
Examples:
llama-3-8b-instruct-Q4_K_M.ggufmistral-7b-v0.3-Q5_K_M.ggufphi-3-mini-Q8_0.gguf
Where to Find GGUF Models
- Hugging Face: Search for "GGUF" - TheBloke and bartowski are prolific quantizers
- Ollama Library: Downloads GGUF models automatically
- Create your own: Use llama.cpp's
convert.pyandquantizetools
Compatibility
GGUF models work with:
- llama.cpp / llama-server: Native support
- Ollama: Uses GGUF internally
- LM Studio: GUI for running GGUF models
- GPT4All: Desktop application
- koboldcpp: Alternative inference server
- text-generation-webui: Popular web interface
vs Other Formats
| Format | Use Case |
|---|---|
| GGUF | Local inference, llama.cpp ecosystem |
| SafeTensors | Hugging Face, full precision |
| GPTQ | GPU inference, AutoGPTQ |
| AWQ | GPU inference, vLLM |
| EXL2 | ExLlamaV2, variable quantization |
GGUF has become the de facto standard for local LLM deployment due to its ease of use and broad compatibility.