An inference server loads large language models into memory and exposes an API for generating text. This is how you run LLMs locally or on your own infrastructure rather than using cloud APIs.
Core Components:
- Model Loading: Loads model weights into GPU VRAM or system RAM
- Tokenization: Converts input text to tokens the model understands
- KV Cache: Stores computed attention values to speed up generation
- Sampling: Applies temperature, top-p, top-k to control output randomness
- Batching: Processes multiple requests efficiently (continuous batching)
Popular Inference Servers:
- llama.cpp / llama-server: C++ implementation, runs on CPU and GPU, supports GGUF quantized models
- vLLM: High-throughput server with PagedAttention, excellent for production
- Ollama: User-friendly wrapper, easy model management, great for getting started
- TGI (Text Generation Inference): Hugging Face's production server
- LocalAI: OpenAI-compatible API for various model types
Key Considerations:
- VRAM requirements (7B model needs ~4-8GB, 70B needs ~40GB+)
- Quantization reduces memory usage (Q4, Q5, Q8 formats)
- Context length affects memory (longer contexts = more VRAM)
- Throughput vs latency tradeoffs
Most inference servers provide OpenAI-compatible endpoints, making it easy to swap between local and cloud models in your applications.