Vector Databases and Embeddings: The Complete Technical Guid

When a user asks your chatbot "what\'s your refund policy?" and it finds the right answer buried in page 47 of your documentation, a vector database made that possible. When Spotify decides which song to play next, it\'s doing a similarity search over embeddings. When your fraud detection system flags an unusual transaction, it\'s comparing a vector against a learned distribution. Vector databases are the infrastructure layer that makes semantic AI search work at scale.

This guide explains everything from what an embedding actually is, to how HNSW indexes find nearest neighbours in milliseconds, to which vector database you should pick for your production workload.

What Is an Embedding?

An embedding is a dense numerical vector that represents meaning. Given a piece of text, an image, or any other data, an embedding model produces an array of floating-point numbers — typically 384 to 3072 dimensions — where the position of that vector in high-dimensional space encodes semantic meaning.

The key property: things with similar meaning end up close together in vector space.

import openai

client = openai.OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",  # 1536 dimensions
    input="What is your return policy?"
)

vector = response.data[0].embedding
# [-0.0234, 0.0891, -0.0445, ... 1536 values total]
print(len(vector))  # 1536

Now embed a semantically similar sentence:

response2 = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do I get a refund?"
)
vector2 = response2.data[0].embedding

These two vectors will be very close together in 1536-dimensional space, even though no words overlap. That\'s the magic: the model has learned that "return policy" and "refund" are semantically related from training on billions of text examples.

How Similarity Is Measured

The standard metric for comparing embeddings is cosine similarity — it measures the angle between two vectors, ignoring their magnitude:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Returns a value between -1 and 1
# 1.0 = identical meaning
# 0.0 = unrelated
# -1.0 = opposite meaning

similarity = cosine_similarity(vector, vector2)
print(similarity)  # typically 0.85-0.95 for semantically related sentences

Other metrics you\'ll encounter:

Euclidean distance (L2) — absolute distance in space. Use when magnitude matters.
Dot product — fast but biased toward higher-magnitude vectors. Used in some retrieval systems where document importance correlates with magnitude.
Manhattan distance (L1) — sum of absolute differences. Rarely used for text.

Most text embedding models are trained to work with cosine similarity. OpenAI\'s text-embedding-3 models are also L2-normalised, meaning cosine similarity and dot product give identical results — a useful property for indexing.

The Naive Approach and Why It Doesn\'t Scale

The brute-force way to find the most similar vector in a collection is to compare your query vector against every stored vector:

def find_most_similar(query_vec, all_vecs):
    similarities = [cosine_similarity(query_vec, v) for v in all_vecs]
    return np.argmax(similarities)

# O(n * d) — linear in the number of vectors
# At 1M vectors × 1536 dims: ~3 billion float multiplications per query

For small collections (under ~100k vectors) this works fine. At millions of vectors, a single query takes seconds. This is why approximate nearest neighbour (ANN) indexes exist.

HNSW: How Vector Search Actually Runs Fast

Hierarchical Navigable Small World (HNSW) is the algorithm behind fast vector search in Qdrant, Weaviate, Pinecone, and pgvector. It builds a multi-layer graph where:

Layer 0 contains all vectors, with edges to nearby neighbours
Higher layers contain exponentially fewer vectors, acting as highway shortcuts

Search works by entering at the top layer (few nodes, fast traversal), greedily moving toward the query, then descending to lower layers for increasingly fine-grained search. The result is approximate — it might miss the single closest vector — but finds the top-k with ~99% recall at millisecond speed.

# HNSW configuration in Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,           # must match your embedding model
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                # edges per node — higher = better recall, more memory
        ef_construction=200, # build-time search width — higher = better index quality
        ef=128,              # query-time search width — higher = better recall, slower
        full_scan_threshold=10_000  # use brute force below this count
    )
)

Key HNSW tradeoffs:

m: 16 is a safe default. Range 8-64. Doubling m roughly doubles memory.
ef_construction: Set to at least 2×m. Higher values improve index quality but slow insertion.
ef (query time): Increase if recall is too low. 64-128 is typical for production.

IVF+PQ: Scaling to Hundreds of Millions of Vectors

HNSW keeps the full vectors in memory — at 1536 dims × 4 bytes × 10M vectors = 61GB RAM just for vectors. For very large collections, IVF+PQ (Inverted File Index + Product Quantization) compresses vectors dramatically:

IVF clusters vectors into Voronoi cells (k-means). At query time, only cells near the query are searched.
PQ splits each vector into sub-vectors and quantises each to 8 bits, compressing 1536 × 4 bytes down to ~192 bytes (8x compression).

import faiss
import numpy as np

d = 1536      # dimensions
nlist = 1024  # number of IVF cells
m = 32        # PQ sub-vectors (d must be divisible by m)
bits = 8      # bits per sub-vector code

quantiser = faiss.IndexFlatIP(d)  # inner product (cosine on normalised vecs)
index = faiss.IndexIVFPQ(quantiser, d, nlist, m, bits)

# Must train on representative sample (at least 39 * nlist vectors)
training_vectors = np.random.randn(50000, d).astype("float32")
faiss.normalize_L2(training_vectors)
index.train(training_vectors)

index.nprobe = 32  # search this many cells at query time (recall vs speed tradeoff)

IVF+PQ enables 100M+ vector collections on a single server. The tradeoff is lower recall (typically 90-95% vs 99%+ for HNSW) and a training step required before indexing.

Embedding Models: Which One to Use

Model	Dims	Context	Best For	Cost
text-embedding-3-small	1536	8191 tokens	Most use cases	$0.02/1M tokens
text-embedding-3-large	3072	8191 tokens	High-accuracy retrieval	$0.13/1M tokens
nomic-embed-text	768	8192 tokens	Local / open source	Free
mxbai-embed-large	1024	512 tokens	Local, high quality	Free
bge-m3	1024	8192 tokens	Multilingual	Free
voyage-3	1024	32000 tokens	Long documents, code	$0.06/1M tokens

Critical rule: always use the same model at index time and query time. Embeddings from different models are not comparable — mixing them will produce nonsense results with no error message.

Building a Production Vector Search Pipeline

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue
import openai
import uuid

client = QdrantClient(url="http://localhost:6333")
oai = openai.OpenAI()

def embed(text: str) -> list[float]:
    return oai.embeddings.create(
        model="text-embedding-3-small",
        input=text.replace("\n", " ")  # newlines degrade quality
    ).data[0].embedding

# ── Indexing ─────────────────────────────────────────────────────────
def index_documents(docs: list[dict]):
    points = []
    for doc in docs:
        vector = embed(doc["content"])
        points.append(PointStruct(
            id=str(uuid.uuid4()),
            vector=vector,
            payload={
                "content": doc["content"],
                "title": doc["title"],
                "source": doc["source"],
                "tenant_id": doc["tenant_id"],  # for multi-tenant filtering
            }
        ))
    client.upsert(collection_name="documents", points=points)

# ── Retrieval ─────────────────────────────────────────────────────────
def search(query: str, tenant_id: str, top_k: int = 5) -> list[dict]:
    query_vector = embed(query)

    results = client.search(
        collection_name="documents",
        query_vector=query_vector,
        limit=top_k,
        query_filter=Filter(           # metadata pre-filter
            must=[FieldCondition(
                key="tenant_id",
                match=MatchValue(value=tenant_id)
            )]
        ),
        with_payload=True,
        score_threshold=0.7            # drop low-confidence results
    )

    return [
        {"content": r.payload["content"], "score": r.score, "title": r.payload["title"]}
        for r in results
    ]

Chunking Strategy: The Most Underrated Variable

How you split documents before embedding has a bigger impact on retrieval quality than which embedding model you choose. Common mistakes:

Chunks too large: The embedding averages the meaning of too many ideas. The cosine score for the relevant chunk is diluted by irrelevant content.
Chunks too small: Individual sentences lack enough context. "It usually takes 3-5 days" embedded alone loses the surrounding context that it refers to shipping times.
Splitting mid-sentence: Breaks grammatical units and degrades embedding quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # characters — roughly 100-150 tokens
    chunk_overlap=64,     # overlap preserves context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],  # tries these in order
)

chunks = splitter.split_text(document_text)

# Parent-child chunking: index small chunks, return larger parent context
# Small chunk (128 tokens) → high precision retrieval
# Return parent chunk (512 tokens) → enough context for the LLM

Recommended defaults: 256-512 token chunks with 10-15% overlap for most knowledge base use cases. For code: split by function/class boundaries. For PDFs: split by section heading, not by character count.

Hybrid Search: Combining Dense and Sparse Vectors

Pure vector search fails on exact lookups: product codes, names, technical identifiers. A user searching for "SKU-47821" gets semantically similar results instead of the exact product. The solution is hybrid search — combining dense embeddings with sparse BM25 keyword scores:

# Qdrant sparse + dense hybrid search
from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector

# At index time: store both dense and sparse representations
client.upsert(
    collection_name="hybrid_docs",
    points=[PointStruct(
        id=point_id,
        vector={
            "dense": dense_embedding,     # semantic
            "sparse": SparseVector(       # BM25 keyword (from SPLADE model)
                indices=bm25_indices,
                values=bm25_values
            )
        },
        payload=metadata
    )]
)

# At query time: search both, then fuse scores with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="hybrid_docs",
    prefetch=[
        {"query": query_dense_vec, "using": "dense", "limit": 20},
        {"query": SparseVector(...), "using": "sparse", "limit": 20},
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=5
)

Hybrid search typically outperforms pure vector search by 5-15% on retrieval benchmarks (MTEB). It\'s the default recommendation for production RAG systems.

Choosing a Vector Database

Database	Best For	Weakness	Scale
Qdrant	Production RAG, multi-tenant SaaS, hybrid search	Newer ecosystem	100M+ vectors
Pinecone	Fastest managed option, no ops burden	Cost at scale, vendor lock-in	Unlimited (managed)
pgvector	Already on Postgres, simple use cases	Slow at >1M vectors without HNSW	~5M vectors reasonably
Weaviate	Multi-modal (text + image), GraphQL API	Complex ops, higher memory	Billions (distributed)
Chroma	Local dev, prototyping	Not production-ready	<1M vectors
Milvus	Massive scale, self-hosted	Complex to operate	Billions of vectors

Decision rule: Start with pgvector if you\'re already on Postgres and have fewer than 500k vectors. Use Qdrant for production RAG applications. Use Pinecone if you want zero infrastructure management and can afford ~$70/month minimum.

pgvector: Production Setup

-- Enable extension
CREATE EXTENSION vector;

-- Store embeddings alongside your data
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding VECTOR(1536),    -- matches text-embedding-3-small
    tenant_id UUID NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index (faster queries, slower inserts vs IVFFlat)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Set query-time recall parameter
SET hnsw.ef_search = 100;

-- Semantic search with metadata filter
SELECT content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE tenant_id = $2
  AND 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;

In Laravel, use the pgvector-php or store as JSON and cast — or use raw DB queries with parameter binding for the vector literal.

Metadata Filtering and Multi-Tenancy

Almost every production vector search system needs to filter results by metadata before or after the ANN search. The two approaches have very different performance characteristics:

Pre-filtering: Filter the candidate set before ANN search. More accurate (searches only within the filtered set) but slower — the HNSW index becomes less effective on small subsets. Required for tenant isolation.
Post-filtering: Run ANN search on all vectors, then filter results by metadata. Faster, but you may get fewer than k results if many are filtered out.

Qdrant uses pre-filtering with a payload index, which is the right default for multi-tenant systems. Create a payload index on any field you filter by:

client.create_payload_index(
    collection_name="documents",
    field_name="tenant_id",
    field_schema="uuid"  # indexed for fast filtering
)

Evaluating Retrieval Quality

Before shipping a RAG system, measure whether your retrieval is actually working:

from ragas import evaluate
from ragas.metrics import context_precision, context_recall

# Build a test dataset: questions + ground truth relevant documents
test_questions = [
    {"question": "What is the refund window?", "ground_truth_context": ["Refunds are accepted within 30 days..."]},
    # ...
]

# Run your retrieval pipeline on each question
results = [retrieve(q["question"]) for q in test_questions]

# Measure
scores = evaluate(
    dataset=test_dataset,
    metrics=[context_precision, context_recall]
)
# context_recall < 0.7 → chunking or embedding problem
# context_precision < 0.7 → too many irrelevant chunks retrieved

Target: context_recall > 0.85, context_precision > 0.75 before wiring the LLM in. Fixing retrieval is always easier than prompting your way around bad context.

Memory Sizing Reference

Planning infrastructure for a vector collection:

text-embedding-3-small (1536 dims, float32): 6KB per vector
1M documents → ~6GB RAM just for vectors (before HNSW graph edges)
HNSW with m=16 adds ~30% overhead → ~8GB for 1M vectors
Practical rule: 1M vectors at 1536 dims needs a 16GB RAM server with room to spare
Use int8 quantisation (supported in Qdrant, pgvector) to cut memory 4x with ~1% recall loss

The Full Picture

Embeddings and vector databases are not magic — they are specific tools for specific problems. They excel at semantic similarity: finding meaning, not exact matches. They struggle with precise lookups, numerical comparisons, and structured queries. The best production systems combine vector search with traditional databases: Postgres for structured data and transactions, a vector store for semantic retrieval, and a search engine like Elasticsearch for full-text keyword queries. Knowing when to reach for each is the skill that separates a good AI engineer from someone who put a vector database in front of everything and wondered why the quality was poor.

Vector Databases and Embeddings: The Complete Technical Guide