Vector Databases and Embeddings: The Complete Technical Guide
Every modern AI application — RAG pipelines, semantic search, recommendation engines, anomaly detection — runs on vector embeddings stored in a vector database. Here is exactly how they work under the hood.
When a user asks your chatbot "what\'s your refund policy?" and it finds the right answer buried in page 47 of your documentation, a vector database made that possible. When Spotify decides which song to play next, it\'s doing a similarity search over embeddings. When your fraud detection system flags an unusual transaction, it\'s comparing a vector against a learned distribution. Vector databases are the infrastructure layer that makes semantic AI search work at scale.
This guide explains everything from what an embedding actually is, to how HNSW indexes find nearest neighbours in milliseconds, to which vector database you should pick for your production workload.
What Is an Embedding?
An embedding is a dense numerical vector that represents meaning. Given a piece of text, an image, or any other data, an embedding model produces an array of floating-point numbers — typically 384 to 3072 dimensions — where the position of that vector in high-dimensional space encodes semantic meaning.
The key property: things with similar meaning end up close together in vector space.
import openai
client = openai.OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small", # 1536 dimensions
input="What is your return policy?"
)
vector = response.data[0].embedding
# [-0.0234, 0.0891, -0.0445, ... 1536 values total]
print(len(vector)) # 1536Now embed a semantically similar sentence:
response2 = client.embeddings.create(
model="text-embedding-3-small",
input="How do I get a refund?"
)
vector2 = response2.data[0].embeddingThese two vectors will be very close together in 1536-dimensional space, even though no words overlap. That\'s the magic: the model has learned that "return policy" and "refund" are semantically related from training on billions of text examples.
How Similarity Is Measured
The standard metric for comparing embeddings is cosine similarity — it measures the angle between two vectors, ignoring their magnitude:
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Returns a value between -1 and 1
# 1.0 = identical meaning
# 0.0 = unrelated
# -1.0 = opposite meaning
similarity = cosine_similarity(vector, vector2)
print(similarity) # typically 0.85-0.95 for semantically related sentencesOther metrics you\'ll encounter:
- Euclidean distance (L2) — absolute distance in space. Use when magnitude matters.
- Dot product — fast but biased toward higher-magnitude vectors. Used in some retrieval systems where document importance correlates with magnitude.
- Manhattan distance (L1) — sum of absolute differences. Rarely used for text.
Most text embedding models are trained to work with cosine similarity. OpenAI\'s text-embedding-3 models are also L2-normalised, meaning cosine similarity and dot product give identical results — a useful property for indexing.
The Naive Approach and Why It Doesn\'t Scale
The brute-force way to find the most similar vector in a collection is to compare your query vector against every stored vector:
def find_most_similar(query_vec, all_vecs):
similarities = [cosine_similarity(query_vec, v) for v in all_vecs]
return np.argmax(similarities)
# O(n * d) — linear in the number of vectors
# At 1M vectors × 1536 dims: ~3 billion float multiplications per queryFor small collections (under ~100k vectors) this works fine. At millions of vectors, a single query takes seconds. This is why approximate nearest neighbour (ANN) indexes exist.
HNSW: How Vector Search Actually Runs Fast
Hierarchical Navigable Small World (HNSW) is the algorithm behind fast vector search in Qdrant, Weaviate, Pinecone, and pgvector. It builds a multi-layer graph where:
- Layer 0 contains all vectors, with edges to nearby neighbours
- Higher layers contain exponentially fewer vectors, acting as highway shortcuts
Search works by entering at the top layer (few nodes, fast traversal), greedily moving toward the query, then descending to lower layers for increasingly fine-grained search. The result is approximate — it might miss the single closest vector — but finds the top-k with ~99% recall at millisecond speed.
# HNSW configuration in Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536, # must match your embedding model
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=16, # edges per node — higher = better recall, more memory
ef_construction=200, # build-time search width — higher = better index quality
ef=128, # query-time search width — higher = better recall, slower
full_scan_threshold=10_000 # use brute force below this count
)
)Key HNSW tradeoffs:
m: 16 is a safe default. Range 8-64. Doubling m roughly doubles memory.ef_construction: Set to at least 2×m. Higher values improve index quality but slow insertion.ef(query time): Increase if recall is too low. 64-128 is typical for production.
IVF+PQ: Scaling to Hundreds of Millions of Vectors
HNSW keeps the full vectors in memory — at 1536 dims × 4 bytes × 10M vectors = 61GB RAM just for vectors. For very large collections, IVF+PQ (Inverted File Index + Product Quantization) compresses vectors dramatically:
- IVF clusters vectors into Voronoi cells (k-means). At query time, only cells near the query are searched.
- PQ splits each vector into sub-vectors and quantises each to 8 bits, compressing 1536 × 4 bytes down to ~192 bytes (8x compression).
import faiss
import numpy as np
d = 1536 # dimensions
nlist = 1024 # number of IVF cells
m = 32 # PQ sub-vectors (d must be divisible by m)
bits = 8 # bits per sub-vector code
quantiser = faiss.IndexFlatIP(d) # inner product (cosine on normalised vecs)
index = faiss.IndexIVFPQ(quantiser, d, nlist, m, bits)
# Must train on representative sample (at least 39 * nlist vectors)
training_vectors = np.random.randn(50000, d).astype("float32")
faiss.normalize_L2(training_vectors)
index.train(training_vectors)
index.nprobe = 32 # search this many cells at query time (recall vs speed tradeoff)IVF+PQ enables 100M+ vector collections on a single server. The tradeoff is lower recall (typically 90-95% vs 99%+ for HNSW) and a training step required before indexing.
Embedding Models: Which One to Use
| Model | Dims | Context | Best For | Cost |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 tokens | Most use cases | $0.02/1M tokens |
| text-embedding-3-large | 3072 | 8191 tokens | High-accuracy retrieval | $0.13/1M tokens |
| nomic-embed-text | 768 | 8192 tokens | Local / open source | Free |
| mxbai-embed-large | 1024 | 512 tokens | Local, high quality | Free |
| bge-m3 | 1024 | 8192 tokens | Multilingual | Free |
| voyage-3 | 1024 | 32000 tokens | Long documents, code | $0.06/1M tokens |
Critical rule: always use the same model at index time and query time. Embeddings from different models are not comparable — mixing them will produce nonsense results with no error message.
Building a Production Vector Search Pipeline
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue
import openai
import uuid
client = QdrantClient(url="http://localhost:6333")
oai = openai.OpenAI()
def embed(text: str) -> list[float]:
return oai.embeddings.create(
model="text-embedding-3-small",
input=text.replace("\n", " ") # newlines degrade quality
).data[0].embedding
# ── Indexing ─────────────────────────────────────────────────────────
def index_documents(docs: list[dict]):
points = []
for doc in docs:
vector = embed(doc["content"])
points.append(PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={
"content": doc["content"],
"title": doc["title"],
"source": doc["source"],
"tenant_id": doc["tenant_id"], # for multi-tenant filtering
}
))
client.upsert(collection_name="documents", points=points)
# ── Retrieval ─────────────────────────────────────────────────────────
def search(query: str, tenant_id: str, top_k: int = 5) -> list[dict]:
query_vector = embed(query)
results = client.search(
collection_name="documents",
query_vector=query_vector,
limit=top_k,
query_filter=Filter( # metadata pre-filter
must=[FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id)
)]
),
with_payload=True,
score_threshold=0.7 # drop low-confidence results
)
return [
{"content": r.payload["content"], "score": r.score, "title": r.payload["title"]}
for r in results
]Chunking Strategy: The Most Underrated Variable
How you split documents before embedding has a bigger impact on retrieval quality than which embedding model you choose. Common mistakes:
- Chunks too large: The embedding averages the meaning of too many ideas. The cosine score for the relevant chunk is diluted by irrelevant content.
- Chunks too small: Individual sentences lack enough context. "It usually takes 3-5 days" embedded alone loses the surrounding context that it refers to shipping times.
- Splitting mid-sentence: Breaks grammatical units and degrades embedding quality.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # characters — roughly 100-150 tokens
chunk_overlap=64, # overlap preserves context at boundaries
separators=["\n\n", "\n", ". ", " ", ""], # tries these in order
)
chunks = splitter.split_text(document_text)
# Parent-child chunking: index small chunks, return larger parent context
# Small chunk (128 tokens) → high precision retrieval
# Return parent chunk (512 tokens) → enough context for the LLMRecommended defaults: 256-512 token chunks with 10-15% overlap for most knowledge base use cases. For code: split by function/class boundaries. For PDFs: split by section heading, not by character count.
Hybrid Search: Combining Dense and Sparse Vectors
Pure vector search fails on exact lookups: product codes, names, technical identifiers. A user searching for "SKU-47821" gets semantically similar results instead of the exact product. The solution is hybrid search — combining dense embeddings with sparse BM25 keyword scores:
# Qdrant sparse + dense hybrid search
from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector
# At index time: store both dense and sparse representations
client.upsert(
collection_name="hybrid_docs",
points=[PointStruct(
id=point_id,
vector={
"dense": dense_embedding, # semantic
"sparse": SparseVector( # BM25 keyword (from SPLADE model)
indices=bm25_indices,
values=bm25_values
)
},
payload=metadata
)]
)
# At query time: search both, then fuse scores with Reciprocal Rank Fusion
results = client.query_points(
collection_name="hybrid_docs",
prefetch=[
{"query": query_dense_vec, "using": "dense", "limit": 20},
{"query": SparseVector(...), "using": "sparse", "limit": 20},
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=5
)Hybrid search typically outperforms pure vector search by 5-15% on retrieval benchmarks (MTEB). It\'s the default recommendation for production RAG systems.
Choosing a Vector Database
| Database | Best For | Weakness | Scale |
|---|---|---|---|
| Qdrant | Production RAG, multi-tenant SaaS, hybrid search | Newer ecosystem | 100M+ vectors |
| Pinecone | Fastest managed option, no ops burden | Cost at scale, vendor lock-in | Unlimited (managed) |
| pgvector | Already on Postgres, simple use cases | Slow at >1M vectors without HNSW | ~5M vectors reasonably |
| Weaviate | Multi-modal (text + image), GraphQL API | Complex ops, higher memory | Billions (distributed) |
| Chroma | Local dev, prototyping | Not production-ready | <1M vectors |
| Milvus | Massive scale, self-hosted | Complex to operate | Billions of vectors |
Decision rule: Start with pgvector if you\'re already on Postgres and have fewer than 500k vectors. Use Qdrant for production RAG applications. Use Pinecone if you want zero infrastructure management and can afford ~$70/month minimum.
pgvector: Production Setup
-- Enable extension
CREATE EXTENSION vector;
-- Store embeddings alongside your data
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding VECTOR(1536), -- matches text-embedding-3-small
tenant_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index (faster queries, slower inserts vs IVFFlat)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Set query-time recall parameter
SET hnsw.ef_search = 100;
-- Semantic search with metadata filter
SELECT content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE tenant_id = $2
AND 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;In Laravel, use the pgvector-php or store as JSON and cast — or use raw DB queries with parameter binding for the vector literal.
Metadata Filtering and Multi-Tenancy
Almost every production vector search system needs to filter results by metadata before or after the ANN search. The two approaches have very different performance characteristics:
- Pre-filtering: Filter the candidate set before ANN search. More accurate (searches only within the filtered set) but slower — the HNSW index becomes less effective on small subsets. Required for tenant isolation.
- Post-filtering: Run ANN search on all vectors, then filter results by metadata. Faster, but you may get fewer than k results if many are filtered out.
Qdrant uses pre-filtering with a payload index, which is the right default for multi-tenant systems. Create a payload index on any field you filter by:
client.create_payload_index(
collection_name="documents",
field_name="tenant_id",
field_schema="uuid" # indexed for fast filtering
)Evaluating Retrieval Quality
Before shipping a RAG system, measure whether your retrieval is actually working:
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
# Build a test dataset: questions + ground truth relevant documents
test_questions = [
{"question": "What is the refund window?", "ground_truth_context": ["Refunds are accepted within 30 days..."]},
# ...
]
# Run your retrieval pipeline on each question
results = [retrieve(q["question"]) for q in test_questions]
# Measure
scores = evaluate(
dataset=test_dataset,
metrics=[context_precision, context_recall]
)
# context_recall < 0.7 → chunking or embedding problem
# context_precision < 0.7 → too many irrelevant chunks retrievedTarget: context_recall > 0.85, context_precision > 0.75 before wiring the LLM in. Fixing retrieval is always easier than prompting your way around bad context.
Memory Sizing Reference
Planning infrastructure for a vector collection:
- text-embedding-3-small (1536 dims, float32): 6KB per vector
- 1M documents → ~6GB RAM just for vectors (before HNSW graph edges)
- HNSW with m=16 adds ~30% overhead → ~8GB for 1M vectors
- Practical rule: 1M vectors at 1536 dims needs a 16GB RAM server with room to spare
- Use int8 quantisation (supported in Qdrant, pgvector) to cut memory 4x with ~1% recall loss
The Full Picture
Embeddings and vector databases are not magic — they are specific tools for specific problems. They excel at semantic similarity: finding meaning, not exact matches. They struggle with precise lookups, numerical comparisons, and structured queries. The best production systems combine vector search with traditional databases: Postgres for structured data and transactions, a vector store for semantic retrieval, and a search engine like Elasticsearch for full-text keyword queries. Knowing when to reach for each is the skill that separates a good AI engineer from someone who put a vector database in front of everything and wondered why the quality was poor.
Senior Full Stack Developer — Laravel, Vue.js, Nuxt.js & AI. Available for freelance projects.
Hire Me for Your Project