⚙️ AI Infrastructure intermediate

Inference Server

A server that hosts LLM models and processes requests to generate text responses, enabling local or self-hosted AI.

An inference server loads large language models into memory and exposes an API for generating text. This is how you run LLMs locally or on your own infrastructure rather than using cloud APIs.

Core Components:

  • Model Loading: Loads model weights into GPU VRAM or system RAM
  • Tokenization: Converts input text to tokens the model understands
  • KV Cache: Stores computed attention values to speed up generation
  • Sampling: Applies temperature, top-p, top-k to control output randomness
  • Batching: Processes multiple requests efficiently (continuous batching)

Popular Inference Servers:

  • llama.cpp / llama-server: C++ implementation, runs on CPU and GPU, supports GGUF quantized models
  • vLLM: High-throughput server with PagedAttention, excellent for production
  • Ollama: User-friendly wrapper, easy model management, great for getting started
  • TGI (Text Generation Inference): Hugging Face's production server
  • LocalAI: OpenAI-compatible API for various model types

Key Considerations:

  • VRAM requirements (7B model needs ~4-8GB, 70B needs ~40GB+)
  • Quantization reduces memory usage (Q4, Q5, Q8 formats)
  • Context length affects memory (longer contexts = more VRAM)
  • Throughput vs latency tradeoffs

Most inference servers provide OpenAI-compatible endpoints, making it easy to swap between local and cloud models in your applications.

// Example Usage

Running llama-server on a machine with an RTX 4090 to serve a quantized Llama 3 70B model, providing an OpenAI-compatible API at localhost:8080 for your applications.