Inference Server

An inference server loads large language models into memory and exposes an API for generating text. This is how you run LLMs locally or on your own infrastructure rather than using cloud APIs.

Core Components:

Model Loading: Loads model weights into GPU VRAM or system RAM
Tokenization: Converts input text to tokens the model understands
KV Cache: Stores computed attention values to speed up generation
Sampling: Applies temperature, top-p, top-k to control output randomness
Batching: Processes multiple requests efficiently (continuous batching)

Popular Inference Servers:

llama.cpp / llama-server: C++ implementation, runs on CPU and GPU, supports GGUF quantized models
vLLM: High-throughput server with PagedAttention, excellent for production
Ollama: User-friendly wrapper, easy model management, great for getting started
TGI (Text Generation Inference): Hugging Face's production server
LocalAI: OpenAI-compatible API for various model types

Key Considerations:

VRAM requirements (7B model needs ~4-8GB, 70B needs ~40GB+)
Quantization reduces memory usage (Q4, Q5, Q8 formats)
Context length affects memory (longer contexts = more VRAM)
Throughput vs latency tradeoffs

Most inference servers provide OpenAI-compatible endpoints, making it easy to swap between local and cloud models in your applications.

// Example Usage

</> Related Terms

LLM (Large Language Model)

Embedding Server

GGUF (GPT-Generated Unified Format)

Quantization

[] More in AI Infrastructure