GGUF (GPT-Generated Unified Format)

GGUF is the modern standard format for distributing quantized language models, replacing the older GGML format. It's the native format for llama.cpp and widely supported across the local AI ecosystem.

Key Features

Single-file distribution: Model weights, tokenizer, and metadata in one file
Memory-mapped loading: Fast startup, efficient memory usage
Flexible quantization: Supports various quantization levels (Q2 through Q8)
Metadata storage: Includes model architecture, training info, tokenizer config
Cross-platform: Works on CPU (x86, ARM) and GPU (CUDA, Metal, ROCm)

File Naming Convention

GGUF files typically follow this pattern:

modelname-size-quantization.gguf

Examples:

llama-3-8b-instruct-Q4_K_M.gguf
mistral-7b-v0.3-Q5_K_M.gguf
phi-3-mini-Q8_0.gguf

Where to Find GGUF Models

Hugging Face: Search for "GGUF" - TheBloke and bartowski are prolific quantizers
Ollama Library: Downloads GGUF models automatically
Create your own: Use llama.cpp's convert.py and quantize tools

Compatibility

GGUF models work with:

llama.cpp / llama-server: Native support
Ollama: Uses GGUF internally
LM Studio: GUI for running GGUF models
GPT4All: Desktop application
koboldcpp: Alternative inference server
text-generation-webui: Popular web interface

vs Other Formats

Format	Use Case
GGUF	Local inference, llama.cpp ecosystem
SafeTensors	Hugging Face, full precision
GPTQ	GPU inference, AutoGPTQ
AWQ	GPU inference, vLLM
EXL2	ExLlamaV2, variable quantization

GGUF has become the de facto standard for local LLM deployment due to its ease of use and broad compatibility.

Key Features

File Naming Convention

Where to Find GGUF Models

Compatibility

vs Other Formats

// Example Usage

</> Related Terms

Inference Server

LLM (Large Language Model)

Quantization

[] More in AI Infrastructure