⚙️ AI Infrastructure intermediate

GGUF (GPT-Generated Unified Format)

File format for storing quantized LLM models, designed for efficient loading and inference with llama.cpp.

GGUF is the modern standard format for distributing quantized language models, replacing the older GGML format. It's the native format for llama.cpp and widely supported across the local AI ecosystem.

Key Features

  • Single-file distribution: Model weights, tokenizer, and metadata in one file
  • Memory-mapped loading: Fast startup, efficient memory usage
  • Flexible quantization: Supports various quantization levels (Q2 through Q8)
  • Metadata storage: Includes model architecture, training info, tokenizer config
  • Cross-platform: Works on CPU (x86, ARM) and GPU (CUDA, Metal, ROCm)

File Naming Convention

GGUF files typically follow this pattern:

modelname-size-quantization.gguf

Examples:

  • llama-3-8b-instruct-Q4_K_M.gguf
  • mistral-7b-v0.3-Q5_K_M.gguf
  • phi-3-mini-Q8_0.gguf

Where to Find GGUF Models

  • Hugging Face: Search for "GGUF" - TheBloke and bartowski are prolific quantizers
  • Ollama Library: Downloads GGUF models automatically
  • Create your own: Use llama.cpp's convert.py and quantize tools

Compatibility

GGUF models work with:

  • llama.cpp / llama-server: Native support
  • Ollama: Uses GGUF internally
  • LM Studio: GUI for running GGUF models
  • GPT4All: Desktop application
  • koboldcpp: Alternative inference server
  • text-generation-webui: Popular web interface

vs Other Formats

Format Use Case
GGUF Local inference, llama.cpp ecosystem
SafeTensors Hugging Face, full precision
GPTQ GPU inference, AutoGPTQ
AWQ GPU inference, vLLM
EXL2 ExLlamaV2, variable quantization

GGUF has become the de facto standard for local LLM deployment due to its ease of use and broad compatibility.

// Example Usage

Downloading mistral-7b-instruct-v0.3-Q5_K_M.gguf from Hugging Face and loading it in Ollama or llama-server for local inference.