⚙️ AI Infrastructure intermediate

Embedding Server

A specialized server that converts text into vector representations (embeddings) for semantic search and RAG applications.

An embedding server hosts embedding models that transform text into high-dimensional vectors capturing semantic meaning. These vectors enable similarity search - finding text that means similar things rather than just matching keywords.

How It Works:

  1. Receive text input (sentence, paragraph, or document)
  2. Tokenize and process through embedding model
  3. Output a fixed-size vector (e.g., 384, 768, or 1536 dimensions)
  4. Vectors are stored in a vector database for later retrieval

Popular Embedding Servers:

  • TEI (Text Embeddings Inference): Hugging Face's optimized server, supports many models
  • Ollama: Also serves embedding models (e.g., nomic-embed-text, mxbai-embed-large)
  • Infinity: Fast embedding server with dynamic batching
  • sentence-transformers: Python library, can be wrapped as a server

Common Embedding Models:

  • nomic-embed-text: Good balance of quality and speed (768 dimensions)
  • mxbai-embed-large: High quality, larger model (1024 dimensions)
  • all-MiniLM-L6-v2: Fast and lightweight (384 dimensions)
  • text-embedding-3-small/large: OpenAI's cloud embeddings

Key Considerations:

  • Batch processing improves throughput significantly
  • Model choice affects quality and dimensionality
  • Embedding models are much smaller than LLMs (run easily on CPU)
  • Same model must be used for indexing and querying
  • Consider chunking strategy for long documents

Embedding servers are essential infrastructure for RAG (Retrieval-Augmented Generation) pipelines, powering semantic search, document Q&A, and knowledge base applications.

// Example Usage

Running TEI with the nomic-embed-text model to embed 10,000 documents into pgvector, enabling semantic search where users find relevant content by meaning rather than keywords.