What is RAG? Retrieval-Augmented Generation Complete Technical Guide

The Problem RAG Solves

Large Language Models have two fundamental limitations that make them unreliable for real-world business applications.

First, knowledge cutoff. Every LLM is trained on data up to a fixed date. GPT-4's training data ends in early 2024. Ask it about a policy change, a new product feature, or last week's meeting notes — it either refuses to answer or hallucinates a plausible fiction.

Second, no access to private data. Your internal documentation, customer history, product specs, and business processes don't exist in an LLM's weights. The model was trained on publicly available internet text. It cannot know your specific business.

Retrieval-Augmented Generation (RAG) solves both problems by retrieving relevant information from an external knowledge base at inference time and injecting it into the model's context window before asking the model to answer. The model reasons over your specific, up-to-date documents rather than relying on memorised training data.

The term was introduced in a 2020 paper by Lewis et al. at Meta AI: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".

The Complete RAG Pipeline

RAG consists of two distinct phases: an offline indexing phase where documents are processed and stored, and an online retrieval and generation phase that runs on every user query.

Phase 1: Indexing (Offline)

Step 1: Document Loading

The pipeline begins by loading documents from their sources: PDFs, Word documents, web pages, Notion databases, Confluence spaces, Google Docs, database records, Markdown files. Each source requires a loader that extracts the raw text content. LangChain provides loaders for 50+ sources out of the box.

Step 2: Chunking

Raw documents are split into smaller pieces called chunks. This is one of the most consequential design decisions in a RAG system. Chunks that are too large waste context window space and may dilute the relevant content. Chunks that are too small may lose essential context, breaking mid-sentence or cutting off critical information.

Common chunking strategies:

Fixed-size chunking: Split every 512 tokens with a 50-token overlap. Simple, predictable, context-unaware.
Recursive character splitting: Try splitting on paragraph breaks, then sentence breaks, then word breaks — preserving natural language structure.
Semantic chunking: Use embedding similarity to identify natural topic boundaries in the text. More expensive but produces more coherent chunks.
Document-structure-aware chunking: Respect headings, sections, and table boundaries. A section with a heading, body, and table should stay together.

In practice, 512–1024 tokens per chunk with 10–15% overlap is a reasonable default for most use cases. Always include metadata in each chunk: source document name, page number, section title, creation date.

Step 3: Embedding

Each chunk is converted into a high-dimensional vector called an embedding using an embedding model. An embedding is a dense numerical representation of the chunk's semantic meaning — typically 768 to 3072 floating-point numbers.

The critical property of embeddings is that semantically similar text produces similar vectors. "How do I reset my password?" and "I forgot my login credentials" will have very similar embedding vectors, even though they share no words. This is what enables semantic search.

Popular embedding models:

OpenAI text-embedding-3-large: 3072 dimensions, excellent quality, $0.13 per million tokens.
OpenAI text-embedding-3-small: 1536 dimensions, good quality, $0.02 per million tokens.
Cohere embed-v3: Specialised for retrieval, supports 100+ languages.
BAAI/bge-m3: Open-source, multilingual, runs locally, competitive with OpenAI on benchmarks.
Nomic embed-text: 768 dimensions, open-source, fast, good for local deployments.

Embedding 10,000 chunks of 512 tokens each with OpenAI text-embedding-3-small costs approximately $0.06. Compare that to the ongoing cost of LLM inference — embeddings are cheap.

Step 4: Vector Storage

The resulting vectors, along with the original chunk text and metadata, are stored in a vector database. The vector database builds an index over the vectors that enables fast approximate nearest-neighbour (ANN) search — finding the N chunks whose embeddings are most similar to a query embedding.

Options: pgvector (PostgreSQL extension), Pinecone (fully managed), Qdrant (open-source, high-performance), Weaviate, Chroma (local development). More on vector databases in the next article.

Phase 2: Retrieval and Generation (Online, per query)

Step 5: Query Embedding

When a user submits a query — "What is the return policy for damaged items?" — the query is embedded using the same embedding model used during indexing. This produces a query vector in the same semantic space as the document chunks.

Step 6: Vector Similarity Search

The vector database performs an approximate nearest-neighbour search, finding the K chunks whose embedding vectors are most similar to the query vector. Similarity is typically measured using cosine similarity — the cosine of the angle between two vectors in the embedding space. Values range from -1 (opposite) to 1 (identical).

A cosine similarity above 0.85 is generally a strong semantic match. Below 0.70, the retrieved content may be irrelevant. Setting a similarity threshold prevents the model from receiving garbage context when the knowledge base has no relevant information.

Step 7: Re-ranking (Optional but recommended)

Vector similarity search retrieves semantically similar chunks but is not always optimally ordered by relevance to the specific query. A re-ranker (cross-encoder model) takes the query and each retrieved chunk as a pair and scores their relevance more precisely. Cohere Rerank, BGE-Reranker, and Jina Reranker are popular options.

Re-ranking adds latency (~100-300ms) but significantly improves the quality of the top-K results, especially when the query is ambiguous or the knowledge base is large.

Step 8: Context Assembly and Prompt Construction

The top-K retrieved chunks (after re-ranking) are assembled into a context block that is inserted into the LLM prompt. A standard RAG prompt template looks like:

System: You are a helpful assistant. Answer questions using ONLY the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
[CHUNK 1 TEXT] (Source: returns-policy.pdf, Page 3)
[CHUNK 2 TEXT] (Source: faq.md, Section: Returns)
[CHUNK 3 TEXT] (Source: terms.pdf, Page 8)

User: What is the return policy for damaged items?

The instruction "Answer using ONLY the provided context" is critical. Without it, the model may supplement retrieved information with hallucinated training knowledge.

Step 9: Generation

The LLM generates an answer grounded in the retrieved context. Because the relevant information is present in the context window, the model does not need to rely on memorised training data. Hallucinations drop dramatically — in production RAG systems, hallucination rates on in-scope queries typically fall below 5%.

Advanced RAG Patterns

HyDE (Hypothetical Document Embedding)

Instead of embedding the raw user query, ask the LLM to generate a hypothetical answer first, then embed that hypothetical answer. The hypothetical answer will use the same vocabulary and phrasing as real documents, producing a better query vector. This significantly improves retrieval recall for vague or short queries.

Multi-Query Retrieval

Generate 3-5 paraphrases of the user query using an LLM, retrieve chunks for each paraphrase, and take the union (deduplicated). Different phrasings retrieve different relevant chunks, improving recall.

Parent Document Retrieval

Index small chunks (for precise semantic matching) but retrieve larger parent chunks (for richer context). When a small chunk matches, return the surrounding 2-3 chunks for better context.

Self-RAG

The LLM itself decides when to retrieve, what to retrieve, and whether the retrieved context is sufficient — using special reflection tokens. Produces more targeted retrieval with less unnecessary context.

Graph RAG

Build a knowledge graph from documents, where entities (people, products, concepts) are nodes and their relationships are edges. Query the graph for entity relationships, then retrieve relevant sub-graphs. Microsoft's GraphRAG paper shows significant improvements over naive RAG on complex, multi-hop questions.

RAG Evaluation: How to Know if Your System Works

RAG systems must be evaluated rigorously. Key metrics:

Context Recall: Was the relevant information actually retrieved? Measures retrieval quality.
Context Precision: How much of the retrieved context was relevant? Measures retrieval noise.
Answer Faithfulness: Is the answer grounded in the retrieved context, or did the model hallucinate?
Answer Relevance: Does the answer actually address the user's question?

RAGAS (RAG Assessment) is the standard open-source framework for automated RAG evaluation. It uses an LLM to score each of these metrics over a test set of questions and ground-truth answers.

Real-World RAG Architecture

In production, a RAG system serving a business chatbot might look like this:

Documents stored in S3 or Google Drive, synced via webhook to a processing queue
Python worker (FastAPI + Celery) that chunks, embeds, and upserts to Qdrant on each document change
Laravel API endpoint that receives user queries
Query embedding via OpenAI, vector search via Qdrant, re-ranking via Cohere
GPT-4o generates the answer with sources cited
Response streamed back to the UI via Server-Sent Events
All queries and responses logged to PostgreSQL for quality review and RAGAS evaluation

This is the architecture I use in production RAG chatbots. Latency from query to first streamed token is typically 800ms–1.5s, which feels instant to users.

What is RAG (Retrieval-Augmented Generation)? Complete Technical Guide