An embedding server hosts embedding models that transform text into high-dimensional vectors capturing semantic meaning. These vectors enable similarity search - finding text that means similar things rather than just matching keywords.
How It Works:
- Receive text input (sentence, paragraph, or document)
- Tokenize and process through embedding model
- Output a fixed-size vector (e.g., 384, 768, or 1536 dimensions)
- Vectors are stored in a vector database for later retrieval
Popular Embedding Servers:
- TEI (Text Embeddings Inference): Hugging Face's optimized server, supports many models
- Ollama: Also serves embedding models (e.g., nomic-embed-text, mxbai-embed-large)
- Infinity: Fast embedding server with dynamic batching
- sentence-transformers: Python library, can be wrapped as a server
Common Embedding Models:
- nomic-embed-text: Good balance of quality and speed (768 dimensions)
- mxbai-embed-large: High quality, larger model (1024 dimensions)
- all-MiniLM-L6-v2: Fast and lightweight (384 dimensions)
- text-embedding-3-small/large: OpenAI's cloud embeddings
Key Considerations:
- Batch processing improves throughput significantly
- Model choice affects quality and dimensionality
- Embedding models are much smaller than LLMs (run easily on CPU)
- Same model must be used for indexing and querying
- Consider chunking strategy for long documents
Embedding servers are essential infrastructure for RAG (Retrieval-Augmented Generation) pipelines, powering semantic search, document Q&A, and knowledge base applications.