What is RAG? Retrieval-Augmented Generation Explained (2026)
RAG (Retrieval-Augmented Generation) gives a language model access to your own data before it answers. The model retrieves matching documents from a vector database, adds them to its prompt as context, and generates a response grounded in those documents. Here is how it works, what it costs, and when to use it.
On this page
- TL;DR — The 60-Second Pitch
- Prerequisites & Setup
- Why RAG Exists (The Story)
- The Mental Model
- Anatomy: Follow One Query
- Hands-On Level 1 — Toy Example
- Hands-On Level 2 — Realistic
- Hands-On Level 3 — Production
- Eight Pitfalls (Lessons from Production)
- RAG vs Fine-Tuning vs Long Context
- Cost & Performance Reality
- Going Further
- Curated Resources
- Self-Check
- FAQ
AI Fundamentals · Updated May 2026 · Reviewed by IITS Azure data engineering team · Den Haag
RAG = LLM + Your Data.
RAG (Retrieval-Augmented Generation) is a technique that gives a language model access to your own data before it answers. The model retrieves matching documents from a vector database, adds them to its prompt as context, and generates a response grounded in those documents. Introduced by Lewis et al. (May 2020), it is now the most common way to ship LLM applications that need to be correct, current, and customer-specific.
What it is
Search + LLM in one pipeline.
What it solves
LLMs do not know your private, current data.
Who needs it
Anyone shipping LLM apps with company data.
After this guide
You will ship a working RAG system in a day.
The RAG Pipeline
Five steps. The vector database does the heavy lifting — it finds documents by meaning, not keywords. The LLM never sees the whole knowledge base, only the slices that matter.
Prerequisites & Setup
You need basic Python (functions, virtual environments, pip), the ability to make an HTTP request, and an API key for OpenAI or Azure OpenAI. That is it.
Install once
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install openai chromadb sentence-transformers cohere langchain numpy
Set your API keys as environment variables. Get an OpenAI key (or use Azure OpenAI in EU-region for production). Optionally, get a Cohere key for re-ranking.
Why RAG Exists (The Story)
In 2019, if you asked a language model "what is our company's refund policy?", you had three bad options:
- Fine-tune the model on your policy. Expensive ($1,000s), slow (days), and every policy update meant re-training the model.
- Paste the entire policy into every prompt. Worked for one document. Fell apart at the first thousand-page knowledge base.
- Live with hallucinations. The model invented refund policies that did not exist.
In May 2020, Lewis et al. at Facebook AI published "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". The idea was elegant: bolt a search engine onto a language model. For every question, search a knowledge base first, then feed the top results to the model along with the question.
The breakthrough insight
You do not need to teach the model your data. You just need to show it the relevant slice at answer time. That separation — between what the model knows and what you give it — is what makes RAG cheap, current, and auditable.
By 2024 every serious LLM product used RAG: ChatGPT for Business, Microsoft Copilot, Perplexity, Notion AI, Glean, Klarna's customer service, Stripe's docs assistant. By 2026 it is the default architecture for LLM apps with private data.
The Mental Model
The open-book exam analogy
An LLM without RAG is a student taking a closed-book exam — they can only use what they memorised during training. An LLM with RAG is taking an open-book exam — they still need to write a good answer, but they can look up the facts in your reference materials. Better notes (chunks), better search (vector DB), better answers.
Five terms you must internalise:
Embedding
Technical: a fixed-length vector (typically 384 to 3,072 numbers) that captures the meaning of a piece of text. Friendly: a coordinate in "meaning space" where similar topics live close together. Common misconception: embeddings are not human-readable; you cannot reverse-engineer the text from them.
Vector database
Technical: a database optimised for nearest-neighbour search across millions of vectors using algorithms like HNSW or IVF. Friendly: a search engine that finds documents by meaning, not by keywords. Common misconception: you do not need a separate database — you can add vector search to Postgres with pgvector.
Chunking
Technical: splitting documents into smaller pieces (200–1,000 tokens each) before embedding. Friendly: cutting your knowledge base into bite-sized pages so search can find the right paragraph, not the whole book. Common misconception: bigger chunks are not better — they dilute meaning and stuff irrelevant content into the prompt.
Retrieval
Technical: the search step — finding the top-K chunks whose embeddings are closest (by cosine similarity) to the query embedding. Friendly: the "look it up" phase before the model answers. Common misconception: retrieval quality is what makes or breaks RAG, not the LLM. Most failures are retrieval failures.
Re-ranking
Technical: a second-pass model (usually a cross-encoder) that re-scores retrieved candidates by actual relevance, then keeps the top 3–5. Friendly: retrieval is the rough sort; re-ranking is the fine sort. Common misconception: re-ranking is optional — in production it is the single highest-leverage upgrade you can make.
Anatomy: Follow One Query
Let us trace a single query end-to-end. The user asks: "What is IITS's refund policy?"
Step 1 — Embed the question
The question is sent to an embedding model (e.g. text-embedding-3-small). Output: a 1,536-dimensional vector like [0.012, -0.831, 0.445, ...]. This vector represents the meaning of the question.
Step 2 — Search the vector database
The vector database compares the query vector to every chunk vector in your index (using HNSW, this takes milliseconds even on millions of chunks). It returns the top-K matches, e.g. K=20, ordered by cosine similarity.
Step 3 — Re-rank the candidates
A cross-encoder (e.g. Cohere Rerank or BGE Reranker) scores the 20 chunks by actual relevance to the question. Keep the top 3–5.
Step 4 — Build the augmented prompt
The final prompt is [system instructions] + [retrieved chunks] + [user question]. The system instructions usually say: "Answer only using the provided context. If the answer is not there, say so."
Step 5 — Generate
The LLM (GPT-4o, Claude Sonnet, Gemini Pro, etc.) produces an answer grounded in the retrieved chunks — and crucially, can cite them. Without RAG: hallucination or "I don't know." With RAG: a quote from your actual policy.
Hands-On Level 1 — Toy Example
The minimum viable RAG. Twenty lines. In-memory. No vector database. Run it on your laptop right now.
import os, numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
DOCS = [
"Our refund policy allows returns within 30 days of purchase.",
"Shipping is free for orders over 50 euros within the Netherlands.",
"Customer support is available Monday to Friday, 9:00 to 17:00 CET.",
"We accept iDEAL, credit card, and SEPA bank transfers.",
]
def embed(text):
return client.embeddings.create(input=text, model="text-embedding-3-small").data[0].embedding
doc_vecs = np.array([embed(d) for d in DOCS])
question = "Can I return something I bought last week?"
q_vec = np.array(embed(question))
# Cosine similarity — find the most relevant doc
scores = doc_vecs @ q_vec / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q_vec))
top_doc = DOCS[int(np.argmax(scores))]
prompt = f"Context: {top_doc}\n\nQuestion: {question}\n\nAnswer using only the context above."
answer = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
print(answer.choices[0].message.content)
Expected output
Yes — our refund policy allows returns within 30 days of purchase, so a return for an item bought last week is within the allowed window.
What just happened
You embedded four documents, embedded a question, found the closest document by cosine similarity, and asked the LLM to answer using only that document. That is RAG in 20 lines. If you stop here, you understand more than most people who use the term.
Hands-On Level 2 — Realistic
Level 1 collapses at four problems: (a) the documents do not fit in memory, (b) there is no chunking, (c) there is no re-ranking, (d) the prompt does not enforce grounding. Level 2 fixes all four with ChromaDB and Cohere Rerank.
import os, chromadb, cohere
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
co = cohere.Client(os.environ["COHERE_API_KEY"])
db = chromadb.PersistentClient(path="./rag_db").get_or_create_collection("docs")
# 1. Chunk a real document (recursive splitter respects paragraphs and sentences)
with open("policy.md") as f:
text = f.read()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(text)
# 2. Embed and store
vectors = [openai.embeddings.create(input=c, model="text-embedding-3-small").data[0].embedding for c in chunks]
db.add(ids=[str(i) for i in range(len(chunks))], embeddings=vectors, documents=chunks)
def ask(question):
# 3. Retrieve top-20 candidates
q_vec = openai.embeddings.create(input=question, model="text-embedding-3-small").data[0].embedding
results = db.query(query_embeddings=[q_vec], n_results=20)
candidates = results["documents"][0]
# 4. Re-rank to top-3
reranked = co.rerank(query=question, documents=candidates, top_n=3, model="rerank-english-v3.0")
top_chunks = [candidates[r.index] for r in reranked.results]
# 5. Augmented prompt with grounding instruction
context = "\n\n---\n\n".join(top_chunks)
prompt = (
"Answer the question using ONLY the context below. "
"If the answer is not in the context, say 'I don't have that information.' "
f"Cite the relevant chunk number.\n\nContext:\n{context}\n\nQuestion: {question}"
)
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return response.choices[0].message.content
print(ask("What is our refund policy for digital products?"))
What changed from Level 1
- Persistent storage: ChromaDB writes to disk so you do not re-embed on every restart.
- Chunking: documents are split at natural boundaries with 64-token overlap to preserve context.
- Two-stage retrieval: 20 fast candidates, re-ranked to 3 precise ones.
- Grounding instruction: the prompt explicitly tells the LLM not to make things up.
- Temperature 0: deterministic answers, fewer hallucinations.
Hands-On Level 3 — Production Pattern
In production you separate concerns, add observability, evaluate quality, and add hybrid (vector + keyword) search. Sketch of a production-shaped module:
# rag/pipeline.py
from dataclasses import dataclass
from typing import Sequence
import logging, time
@dataclass
class RetrievedChunk:
text: str
score: float
source: str
chunk_id: str
metadata: dict
class RAGPipeline:
def __init__(self, embedder, vector_store, keyword_store, reranker, llm, *, top_k=20, top_n=4):
self.embedder = embedder
self.vector_store = vector_store # e.g. Azure AI Search, pgvector
self.keyword_store = keyword_store # e.g. BM25 over the same corpus
self.reranker = reranker # cross-encoder
self.llm = llm
self.top_k = top_k
self.top_n = top_n
self.log = logging.getLogger("rag")
def ask(self, question: str, *, tenant_id: str, trace_id: str) -> dict:
t0 = time.perf_counter()
# 1. Hybrid retrieval — vector + keyword in parallel
q_vec = self.embedder.embed(question)
vec_hits = self.vector_store.search(q_vec, top_k=self.top_k, filter={"tenant": tenant_id})
kw_hits = self.keyword_store.search(question, top_k=self.top_k, filter={"tenant": tenant_id})
merged = self._reciprocal_rank_fusion(vec_hits, kw_hits)
# 2. Re-rank
reranked = self.reranker.rerank(question, merged, top_n=self.top_n)
# 3. Fallback if nothing relevant
if not reranked or reranked[0].score < 0.3:
self.log.info("low_confidence_retrieval", extra={"trace_id": trace_id})
return {"answer": "I don't have information on that.", "sources": [], "trace_id": trace_id}
# 4. Augmented prompt
context = self._format_context(reranked)
prompt = self._build_prompt(question, context)
answer = self.llm.generate(prompt, temperature=0, max_tokens=600)
# 5. Observability
self.log.info(
"rag_query",
extra={
"trace_id": trace_id,
"tenant_id": tenant_id,
"latency_ms": int((time.perf_counter() - t0) * 1000),
"n_retrieved": len(merged),
"n_used": len(reranked),
"top_score": reranked[0].score,
},
)
return {
"answer": answer,
"sources": [{"id": c.chunk_id, "source": c.source, "score": c.score} for c in reranked],
"trace_id": trace_id,
}
Why hybrid search
Vector search misses exact terms (SKU codes, error messages). BM25 catches them. Reciprocal Rank Fusion merges both rankings cheaply.
Why metadata filtering
Filter by tenant, date, permission level before vector search. Critical for multi-tenant or regulated apps.
Why a fallback
If the top score is below a threshold (e.g. 0.3), say "I don't know" instead of letting the LLM improvise.
Why structured logging
Every query logs latency, scores, chunk IDs, tenant. You will need this the first time a customer asks "why did your AI tell me X?"
Add an evaluation harness with RAGAS measuring faithfulness (is the answer grounded in the context?), answer relevancy, and context precision/recall. Run it on every change to your chunking, embeddings, or prompt.
Eight Pitfalls (Lessons from Production)
1. Chunk size hell
The trap: people pick "chunk every 1,000 characters" because it sounds round. The result: sentences cut in half, code blocks shredded, tables destroyed. The fix: use a recursive splitter that respects paragraphs, headings, sentences. Start at 512 tokens with 64 overlap. For markdown, use MarkdownHeaderTextSplitter.
2. Skipping re-ranking
The trap: "vector search is fast and free, I'll just retrieve top-5." The result: the right chunk is at position 7 and you miss it. The fix: retrieve top-20, re-rank to top-3 with Cohere Rerank or BGE Reranker. Typical quality lift: 20–30%.
3. Wrong embeddings for your domain
The trap: "OpenAI embeddings are the default, so they must be best." The result: mediocre retrieval on legal, medical, code, or multilingual data. The fix: benchmark Voyage AI (legal/code), BGE-M3 (multilingual), Cohere Embed v3 (general), and Azure-hosted models. Test on your queries.
4. No metadata filtering
The trap: one big vector index for all customers. The result: tenant A sees tenant B's data. Or last year's price list shows up in answers about current pricing. The fix: filter by tenant_id, valid_from, permission_level before the vector search. Every modern vector DB supports this.
5. Treating retrieval quality as a black box
The trap: "the answer looks fine, ship it." The result: you cannot tell if a regression came from chunking, embeddings, re-ranking, or the prompt. The fix: build a golden set of 50–200 question/expected-answer pairs. Measure recall@K (did the right chunk show up at all?) and precision@K (how many of the K were useful?) on every change.
6. Stuffing too many chunks into the prompt
The trap: "more context = better answer." The result: the model loses the thread, latency doubles, cost triples. The fix: 3–5 high-quality chunks beat 20 mediocre ones. If you need more, summarise first or use hierarchical retrieval.
7. No fallback when retrieval finds nothing
The trap: sending an empty context to the LLM with "answer this question." The result: the model improvises a plausible-sounding wrong answer. The fix: if the top retrieval score is below threshold, return a polite "I don't have that information." Or escalate to a human.
8. Forgetting that embedding model = locked-in cost
The trap: pick an embedding model, embed millions of chunks, ship. The result: a year later you want a better model — re-embedding the whole corpus costs thousands of euros and takes days. The fix: benchmark embeddings before you scale. Once you commit, version your index so you can run two models side-by-side during migration.
Skim summary if you stop here
RAG is search + LLM. Most failures are retrieval failures, not LLM failures. Chunk well, re-rank always, evaluate constantly, fall back gracefully. You now know more about RAG than 95% of people shipping it.
RAG vs Fine-Tuning vs Long Context vs Cache-Augmented
| Dimension | RAG | Fine-tuning | Long context | Cache-augmented (CAG) |
|---|---|---|---|---|
| Best for | Knowledge that changes; large corpora; citations needed | Style, format, domain-specific reasoning | Single document Q&A; one-shot summarisation | Small, stable knowledge base (< 100K tokens) |
| Worst for | Skills that need new behaviours (tone, format) | Frequently changing facts | Corpora > 200K tokens; cost-sensitive workloads | Anything that updates often |
| Setup cost | Days | Weeks + dataset curation | Hours | Hours |
| Per-query cost | Low (1–5 cents) | Low | High (long context tokens add up fast) | Low (cached) |
| Update latency | Seconds (re-embed a doc) | Days (re-train) | Instant (just change the prompt) | Minutes (rebuild cache) |
| Auditability | High — cite chunks | Low — knowledge baked into weights | Medium — context visible but huge | Medium |
Honest verdict
Choose RAG for almost every business case involving private, changing, citation-worthy data. Choose fine-tuning when you need the model to behave differently (tone, structured output, domain reasoning) — and combine it with RAG. Choose long context when you have one big document and a small query volume. Choose CAG only for tiny, stable knowledge bases where re-using the same cached context across queries pays off.
Cost & Performance Reality
| Component | Cost (2026) | Latency | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 / 1M tokens | ~50 ms | Default choice; 1,536 dims |
| OpenAI text-embedding-3-large | $0.13 / 1M tokens | ~80 ms | 3,072 dims; ~5% better quality |
| Voyage voyage-3-large | $0.12 / 1M tokens | ~60 ms | Strong on code, legal, finance |
| Cohere Rerank v3 | $2 / 1K searches | ~150 ms | Reranks 100 docs per search |
| pgvector on your Postgres | Free (compute only) | 5–50 ms | Best up to ~10M vectors |
| Qdrant Cloud (managed) | From €25 / month | 10–30 ms | Strong filtering, open source |
| Pinecone Standard | From $70 / month | 10–30 ms | Serverless options available |
| Azure AI Search (Basic) | From €70 / month | 20–80 ms | EU-region, integrates with Azure OpenAI |
| GPT-4o-mini answer | $0.15 / $0.60 per 1M tokens (in/out) | 300–800 ms | Default for most RAG |
| Claude Sonnet 4 answer | $3 / $15 per 1M tokens | 500–1500 ms | Best for nuanced grounded reasoning |
Typical end-to-end query (production)
Embed (50ms) + vector search (20ms) + rerank (150ms) + LLM (600ms) = ~800ms p50 latency, ~$0.002–$0.01 per query. The hidden cost most teams miss: re-embedding the corpus when you upgrade models. Embedding 10M chunks at 200 tokens each = 2B tokens × $0.02/1M = ~$40 per re-embed — cheap with small embeddings, painful with large. Budget for it.
Going Further
Tomorrow
Run Level 1 against 50 of your team's documents. Measure: does it find the right one for 10 real questions? You will learn more from one afternoon of testing than a week of reading.
Next month
Ship Level 2 to staging. Add a RAGAS evaluation harness. A/B test two embedding models on your real queries. Decide between pgvector (if you already have Postgres) and Qdrant/Pinecone (if you need scale).
Long-term
Explore GraphRAG (Microsoft), Contextual Retrieval (Anthropic), HyDE, RAG-Fusion, and Adaptive RAG. Read the original Lewis et al. paper and follow Eugene Yan's RAG patterns.
Curated Resources
- Lewis et al. (2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (the original paper)
- Anthropic — Contextual Retrieval (a 2024 technique that cuts retrieval failures by 49%)
- Eugene Yan — Patterns for Building LLM-based Systems & Products (the most-cited practical guide)
- RAGAS — RAG evaluation library
- LlamaIndex docs (the most RAG-focused framework)
- Azure AI Search RAG overview (production patterns for Microsoft stack)
Self-Check: Have You Learned It?
1. Why does RAG usually beat fine-tuning for company knowledge?
Answer: Company knowledge changes constantly (policies, prices, employees). Fine-tuning bakes facts into model weights — every change means re-training. RAG lets you re-embed a single document in seconds and the model sees the new version on the next query.
2. What is the difference between retrieval and re-ranking?
Answer: Retrieval is fast and approximate — it finds ~20 candidate chunks via vector similarity in tens of milliseconds. Re-ranking is slower and precise — it uses a cross-encoder to score actual relevance of those candidates, then keeps the top 3–5. Without re-ranking, the most relevant chunk is often in your top-20 but not at position 1.
3. Your RAG system returns chunks but the answers are wrong. Where do you look first?
Answer: Inspect the actual chunks being passed to the LLM. 80% of the time the answer is: the right chunk was not retrieved (chunking too aggressive, wrong embedding model, missing metadata filter, no re-ranking). 15%: the chunks are right but the prompt does not enforce grounding. Only 5% is "the LLM hallucinated despite good context."
4. You serve 10 customers from one RAG. How do you prevent customer A from seeing customer B's documents?
Answer: Metadata filter by tenant_id applied before vector search. Every chunk's metadata includes the tenant; every query passes the requesting tenant; the vector DB only searches within that tenant's vectors. Never rely on the LLM to filter — filter at the database level.
5. What is wrong with this code?
chunks = db.query(query_embeddings=[q], n_results=50)
context = "\n".join(chunks)
prompt = f"Answer: {question}\nContext: {context}"
Answer: Three issues. (1) 50 chunks is too many — the LLM loses focus. Retrieve 20, re-rank to 3–5. (2) No grounding instruction — the LLM is not told to answer only from the context. (3) No fallback if scores are low — the model will improvise. Also: "\n".join destroys chunk boundaries; use "\n\n---\n\n".
6. When would you choose long-context LLMs over RAG?
Answer: When you have one document (under 200K tokens), low query volume, and need the model to reason across the whole thing. Examples: summarising a single annual report, answering questions about one long contract, code review of a single repo. For anything beyond that, RAG is cheaper, faster, and more accurate.
7. Design a RAG for an internal HR assistant that handles 200 employees' personal data. What three things would you do differently from a generic RAG?
Answer: (1) Per-employee metadata filtering — every chunk tagged with employee_id, query filtered so the LLM only sees the asking employee's data. (2) Run everything in your own Azure tenant in EU-region — Azure OpenAI for the LLM, Azure AI Search for vectors, no third-party APIs. (3) Strict audit logging — every query logs the user, the chunks retrieved, the prompt, and the answer, retained per AVG/GDPR rules. Plus a deny-list for sensitive actions ("never reveal another employee's salary").
Frequently Asked Questions
What does RAG stand for?
RAG stands for Retrieval-Augmented Generation. It was introduced by Lewis et al. at Facebook AI Research in May 2020.
When should I use RAG instead of fine-tuning?
Use RAG when data changes often, you need citations, or you have more than a few thousand documents. Use fine-tuning for behaviour the model needs to learn (style, format, domain reasoning).
Do I need a vector database to do RAG?
For prototypes, no — NumPy in memory works. For production, yes — pgvector, Qdrant, Pinecone, Azure AI Search, Weaviate, Milvus, or Chroma.
How much does RAG cost?
$0.02 per 1M tokens for OpenAI embeddings (2026), $25–$70/month for hosted vector DBs, $0.001–$0.01 per query end-to-end. Hidden cost: re-embedding when you upgrade models.
Why does RAG sometimes give wrong answers?
Usually retrieval failure: wrong chunk retrieved, no re-ranking, no metadata filter, or no grounding instruction in the prompt. Inspect the chunks before blaming the LLM.
What is the difference between RAG and an AI agent?
RAG is a retrieval pattern. An agent is a system that decides which tools (including RAG) to call. Most production agents use RAG as one of their tools.
What is chunking in RAG?
Splitting documents into smaller pieces (typically 200–1,000 tokens) before embedding. Good chunking respects paragraphs, headings, and code blocks.
What is re-ranking and do I need it?
A second-pass model that scores top-20 retrievals for actual relevance and keeps the best 3–5. Lifts retrieval quality 20–30% in production. Yes, you almost always need it.
Can RAG work with private or compliance-sensitive data?
Yes — that is its strongest use case. Run embeddings and vector DB in your own Azure/AWS region, keep documents in your tenant, route LLM through Azure OpenAI or a self-hosted model.
What is the difference between RAG and long-context LLMs?
Long-context lets you paste one big document into a prompt. RAG scales to millions of documents, costs less per query, and stays accurate beyond 100K tokens of context.
Next read: how RAG fits into the full agent stack
RAG is one of seven skills a production agent engineer needs. To see how retrieval connects with tool contracts, reliability, security, and observability, read From Prompt Engineer to Agent Engineer: The 7 Skills That Actually Matter.
Ready to ship this in production?
IITS builds production-ready Azure data and AI systems for Dutch businesses. Fixed prices, 2-week pilots, EU-region Azure.
Book a strategy call →