What is RAG? Retrieval-Augmented Generation Explained (2026)

RAG (Retrieval-Augmented Generation) gives a language model access to your own data before it answers. The model retrieves matching documents from a vector database, adds them to its prompt as context, and generates a response grounded in those documents. Here is how it works, what it costs, and when to use it.

25 min read·2026-05-26·Foundation
On this page

AI Fundamentals · Updated May 2026 · Reviewed by IITS Azure data engineering team · Den Haag

RAG = LLM + Your Data.

RAG (Retrieval-Augmented Generation) is a technique that gives a language model access to your own data before it answers. The model retrieves matching documents from a vector database, adds them to its prompt as context, and generates a response grounded in those documents. Introduced by Lewis et al. (May 2020), it is now the most common way to ship LLM applications that need to be correct, current, and customer-specific.

What it is

Search + LLM in one pipeline.

What it solves

LLMs do not know your private, current data.

Who needs it

Anyone shipping LLM apps with company data.

After this guide

You will ship a working RAG system in a day.

The RAG Pipeline

1. QUERY "What's our refund policy?" 2. EMBED Text → vector [0.12, -0.83, ...] 3. VECTOR DATABASE — Search by Meaning Find top-K chunks closest to query vector pgvector · Qdrant · Pinecone · Azure AI Search 4. TOP-K CHUNKS 3–5 most relevant document chunks 5. LLM Generate answer grounded in chunks grounded answer back to user USER GROUNDED LLM CALL

Five steps. The vector database does the heavy lifting — it finds documents by meaning, not keywords. The LLM never sees the whole knowledge base, only the slices that matter.

Prerequisites & Setup

You need basic Python (functions, virtual environments, pip), the ability to make an HTTP request, and an API key for OpenAI or Azure OpenAI. That is it.

Install once

python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install openai chromadb sentence-transformers cohere langchain numpy

Set your API keys as environment variables. Get an OpenAI key (or use Azure OpenAI in EU-region for production). Optionally, get a Cohere key for re-ranking.

Why RAG Exists (The Story)

In 2019, if you asked a language model "what is our company's refund policy?", you had three bad options:

  • Fine-tune the model on your policy. Expensive ($1,000s), slow (days), and every policy update meant re-training the model.
  • Paste the entire policy into every prompt. Worked for one document. Fell apart at the first thousand-page knowledge base.
  • Live with hallucinations. The model invented refund policies that did not exist.

In May 2020, Lewis et al. at Facebook AI published "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". The idea was elegant: bolt a search engine onto a language model. For every question, search a knowledge base first, then feed the top results to the model along with the question.

The breakthrough insight

You do not need to teach the model your data. You just need to show it the relevant slice at answer time. That separation — between what the model knows and what you give it — is what makes RAG cheap, current, and auditable.

By 2024 every serious LLM product used RAG: ChatGPT for Business, Microsoft Copilot, Perplexity, Notion AI, Glean, Klarna's customer service, Stripe's docs assistant. By 2026 it is the default architecture for LLM apps with private data.

The Mental Model

The open-book exam analogy

An LLM without RAG is a student taking a closed-book exam — they can only use what they memorised during training. An LLM with RAG is taking an open-book exam — they still need to write a good answer, but they can look up the facts in your reference materials. Better notes (chunks), better search (vector DB), better answers.

Five terms you must internalise:

Embedding

Technical: a fixed-length vector (typically 384 to 3,072 numbers) that captures the meaning of a piece of text. Friendly: a coordinate in "meaning space" where similar topics live close together. Common misconception: embeddings are not human-readable; you cannot reverse-engineer the text from them.

Vector database

Technical: a database optimised for nearest-neighbour search across millions of vectors using algorithms like HNSW or IVF. Friendly: a search engine that finds documents by meaning, not by keywords. Common misconception: you do not need a separate database — you can add vector search to Postgres with pgvector.

Chunking

Technical: splitting documents into smaller pieces (200–1,000 tokens each) before embedding. Friendly: cutting your knowledge base into bite-sized pages so search can find the right paragraph, not the whole book. Common misconception: bigger chunks are not better — they dilute meaning and stuff irrelevant content into the prompt.

Retrieval

Technical: the search step — finding the top-K chunks whose embeddings are closest (by cosine similarity) to the query embedding. Friendly: the "look it up" phase before the model answers. Common misconception: retrieval quality is what makes or breaks RAG, not the LLM. Most failures are retrieval failures.

Re-ranking

Technical: a second-pass model (usually a cross-encoder) that re-scores retrieved candidates by actual relevance, then keeps the top 3–5. Friendly: retrieval is the rough sort; re-ranking is the fine sort. Common misconception: re-ranking is optional — in production it is the single highest-leverage upgrade you can make.

Anatomy: Follow One Query

Let us trace a single query end-to-end. The user asks: "What is IITS's refund policy?"

Step 1 — Embed the question

The question is sent to an embedding model (e.g. text-embedding-3-small). Output: a 1,536-dimensional vector like [0.012, -0.831, 0.445, ...]. This vector represents the meaning of the question.

Step 2 — Search the vector database

The vector database compares the query vector to every chunk vector in your index (using HNSW, this takes milliseconds even on millions of chunks). It returns the top-K matches, e.g. K=20, ordered by cosine similarity.

Step 3 — Re-rank the candidates

A cross-encoder (e.g. Cohere Rerank or BGE Reranker) scores the 20 chunks by actual relevance to the question. Keep the top 3–5.

Step 4 — Build the augmented prompt

The final prompt is [system instructions] + [retrieved chunks] + [user question]. The system instructions usually say: "Answer only using the provided context. If the answer is not there, say so."

Step 5 — Generate

The LLM (GPT-4o, Claude Sonnet, Gemini Pro, etc.) produces an answer grounded in the retrieved chunks — and crucially, can cite them. Without RAG: hallucination or "I don't know." With RAG: a quote from your actual policy.

Hands-On Level 1 — Toy Example

The minimum viable RAG. Twenty lines. In-memory. No vector database. Run it on your laptop right now.

import os, numpy as np
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

DOCS = [
    "Our refund policy allows returns within 30 days of purchase.",
    "Shipping is free for orders over 50 euros within the Netherlands.",
    "Customer support is available Monday to Friday, 9:00 to 17:00 CET.",
    "We accept iDEAL, credit card, and SEPA bank transfers.",
]

def embed(text):
    return client.embeddings.create(input=text, model="text-embedding-3-small").data[0].embedding

doc_vecs = np.array([embed(d) for d in DOCS])
question = "Can I return something I bought last week?"
q_vec = np.array(embed(question))

# Cosine similarity — find the most relevant doc
scores = doc_vecs @ q_vec / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q_vec))
top_doc = DOCS[int(np.argmax(scores))]

prompt = f"Context: {top_doc}\n\nQuestion: {question}\n\nAnswer using only the context above."
answer = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
print(answer.choices[0].message.content)

Expected output

Yes — our refund policy allows returns within 30 days of purchase, so a return for an item bought last week is within the allowed window.

What just happened

You embedded four documents, embedded a question, found the closest document by cosine similarity, and asked the LLM to answer using only that document. That is RAG in 20 lines. If you stop here, you understand more than most people who use the term.

Hands-On Level 2 — Realistic

Level 1 collapses at four problems: (a) the documents do not fit in memory, (b) there is no chunking, (c) there is no re-ranking, (d) the prompt does not enforce grounding. Level 2 fixes all four with ChromaDB and Cohere Rerank.

import os, chromadb, cohere
from openai import OpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter

openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
co = cohere.Client(os.environ["COHERE_API_KEY"])
db = chromadb.PersistentClient(path="./rag_db").get_or_create_collection("docs")

# 1. Chunk a real document (recursive splitter respects paragraphs and sentences)
with open("policy.md") as f:
    text = f.read()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(text)

# 2. Embed and store
vectors = [openai.embeddings.create(input=c, model="text-embedding-3-small").data[0].embedding for c in chunks]
db.add(ids=[str(i) for i in range(len(chunks))], embeddings=vectors, documents=chunks)

def ask(question):
    # 3. Retrieve top-20 candidates
    q_vec = openai.embeddings.create(input=question, model="text-embedding-3-small").data[0].embedding
    results = db.query(query_embeddings=[q_vec], n_results=20)
    candidates = results["documents"][0]

    # 4. Re-rank to top-3
    reranked = co.rerank(query=question, documents=candidates, top_n=3, model="rerank-english-v3.0")
    top_chunks = [candidates[r.index] for r in reranked.results]

    # 5. Augmented prompt with grounding instruction
    context = "\n\n---\n\n".join(top_chunks)
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the answer is not in the context, say 'I don't have that information.' "
        f"Cite the relevant chunk number.\n\nContext:\n{context}\n\nQuestion: {question}"
    )
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content

print(ask("What is our refund policy for digital products?"))

What changed from Level 1

  • Persistent storage: ChromaDB writes to disk so you do not re-embed on every restart.
  • Chunking: documents are split at natural boundaries with 64-token overlap to preserve context.
  • Two-stage retrieval: 20 fast candidates, re-ranked to 3 precise ones.
  • Grounding instruction: the prompt explicitly tells the LLM not to make things up.
  • Temperature 0: deterministic answers, fewer hallucinations.

Hands-On Level 3 — Production Pattern

In production you separate concerns, add observability, evaluate quality, and add hybrid (vector + keyword) search. Sketch of a production-shaped module:

# rag/pipeline.py
from dataclasses import dataclass
from typing import Sequence
import logging, time

@dataclass
class RetrievedChunk:
    text: str
    score: float
    source: str
    chunk_id: str
    metadata: dict

class RAGPipeline:
    def __init__(self, embedder, vector_store, keyword_store, reranker, llm, *, top_k=20, top_n=4):
        self.embedder = embedder
        self.vector_store = vector_store      # e.g. Azure AI Search, pgvector
        self.keyword_store = keyword_store    # e.g. BM25 over the same corpus
        self.reranker = reranker              # cross-encoder
        self.llm = llm
        self.top_k = top_k
        self.top_n = top_n
        self.log = logging.getLogger("rag")

    def ask(self, question: str, *, tenant_id: str, trace_id: str) -> dict:
        t0 = time.perf_counter()

        # 1. Hybrid retrieval — vector + keyword in parallel
        q_vec = self.embedder.embed(question)
        vec_hits = self.vector_store.search(q_vec, top_k=self.top_k, filter={"tenant": tenant_id})
        kw_hits = self.keyword_store.search(question, top_k=self.top_k, filter={"tenant": tenant_id})
        merged = self._reciprocal_rank_fusion(vec_hits, kw_hits)

        # 2. Re-rank
        reranked = self.reranker.rerank(question, merged, top_n=self.top_n)

        # 3. Fallback if nothing relevant
        if not reranked or reranked[0].score < 0.3:
            self.log.info("low_confidence_retrieval", extra={"trace_id": trace_id})
            return {"answer": "I don't have information on that.", "sources": [], "trace_id": trace_id}

        # 4. Augmented prompt
        context = self._format_context(reranked)
        prompt = self._build_prompt(question, context)
        answer = self.llm.generate(prompt, temperature=0, max_tokens=600)

        # 5. Observability
        self.log.info(
            "rag_query",
            extra={
                "trace_id": trace_id,
                "tenant_id": tenant_id,
                "latency_ms": int((time.perf_counter() - t0) * 1000),
                "n_retrieved": len(merged),
                "n_used": len(reranked),
                "top_score": reranked[0].score,
            },
        )

        return {
            "answer": answer,
            "sources": [{"id": c.chunk_id, "source": c.source, "score": c.score} for c in reranked],
            "trace_id": trace_id,
        }

Why hybrid search

Vector search misses exact terms (SKU codes, error messages). BM25 catches them. Reciprocal Rank Fusion merges both rankings cheaply.

Why metadata filtering

Filter by tenant, date, permission level before vector search. Critical for multi-tenant or regulated apps.

Why a fallback

If the top score is below a threshold (e.g. 0.3), say "I don't know" instead of letting the LLM improvise.

Why structured logging

Every query logs latency, scores, chunk IDs, tenant. You will need this the first time a customer asks "why did your AI tell me X?"

Add an evaluation harness with RAGAS measuring faithfulness (is the answer grounded in the context?), answer relevancy, and context precision/recall. Run it on every change to your chunking, embeddings, or prompt.

Eight Pitfalls (Lessons from Production)

1. Chunk size hell

The trap: people pick "chunk every 1,000 characters" because it sounds round. The result: sentences cut in half, code blocks shredded, tables destroyed. The fix: use a recursive splitter that respects paragraphs, headings, sentences. Start at 512 tokens with 64 overlap. For markdown, use MarkdownHeaderTextSplitter.

2. Skipping re-ranking

The trap: "vector search is fast and free, I'll just retrieve top-5." The result: the right chunk is at position 7 and you miss it. The fix: retrieve top-20, re-rank to top-3 with Cohere Rerank or BGE Reranker. Typical quality lift: 20–30%.

3. Wrong embeddings for your domain

The trap: "OpenAI embeddings are the default, so they must be best." The result: mediocre retrieval on legal, medical, code, or multilingual data. The fix: benchmark Voyage AI (legal/code), BGE-M3 (multilingual), Cohere Embed v3 (general), and Azure-hosted models. Test on your queries.

4. No metadata filtering

The trap: one big vector index for all customers. The result: tenant A sees tenant B's data. Or last year's price list shows up in answers about current pricing. The fix: filter by tenant_id, valid_from, permission_level before the vector search. Every modern vector DB supports this.

5. Treating retrieval quality as a black box

The trap: "the answer looks fine, ship it." The result: you cannot tell if a regression came from chunking, embeddings, re-ranking, or the prompt. The fix: build a golden set of 50–200 question/expected-answer pairs. Measure recall@K (did the right chunk show up at all?) and precision@K (how many of the K were useful?) on every change.

6. Stuffing too many chunks into the prompt

The trap: "more context = better answer." The result: the model loses the thread, latency doubles, cost triples. The fix: 3–5 high-quality chunks beat 20 mediocre ones. If you need more, summarise first or use hierarchical retrieval.

7. No fallback when retrieval finds nothing

The trap: sending an empty context to the LLM with "answer this question." The result: the model improvises a plausible-sounding wrong answer. The fix: if the top retrieval score is below threshold, return a polite "I don't have that information." Or escalate to a human.

8. Forgetting that embedding model = locked-in cost

The trap: pick an embedding model, embed millions of chunks, ship. The result: a year later you want a better model — re-embedding the whole corpus costs thousands of euros and takes days. The fix: benchmark embeddings before you scale. Once you commit, version your index so you can run two models side-by-side during migration.

Skim summary if you stop here

RAG is search + LLM. Most failures are retrieval failures, not LLM failures. Chunk well, re-rank always, evaluate constantly, fall back gracefully. You now know more about RAG than 95% of people shipping it.

RAG vs Fine-Tuning vs Long Context vs Cache-Augmented

Dimension RAG Fine-tuning Long context Cache-augmented (CAG)
Best forKnowledge that changes; large corpora; citations neededStyle, format, domain-specific reasoningSingle document Q&A; one-shot summarisationSmall, stable knowledge base (< 100K tokens)
Worst forSkills that need new behaviours (tone, format)Frequently changing factsCorpora > 200K tokens; cost-sensitive workloadsAnything that updates often
Setup costDaysWeeks + dataset curationHoursHours
Per-query costLow (1–5 cents)LowHigh (long context tokens add up fast)Low (cached)
Update latencySeconds (re-embed a doc)Days (re-train)Instant (just change the prompt)Minutes (rebuild cache)
AuditabilityHigh — cite chunksLow — knowledge baked into weightsMedium — context visible but hugeMedium

Honest verdict

Choose RAG for almost every business case involving private, changing, citation-worthy data. Choose fine-tuning when you need the model to behave differently (tone, structured output, domain reasoning) — and combine it with RAG. Choose long context when you have one big document and a small query volume. Choose CAG only for tiny, stable knowledge bases where re-using the same cached context across queries pays off.

Cost & Performance Reality

Component Cost (2026) Latency Notes
OpenAI text-embedding-3-small$0.02 / 1M tokens~50 msDefault choice; 1,536 dims
OpenAI text-embedding-3-large$0.13 / 1M tokens~80 ms3,072 dims; ~5% better quality
Voyage voyage-3-large$0.12 / 1M tokens~60 msStrong on code, legal, finance
Cohere Rerank v3$2 / 1K searches~150 msReranks 100 docs per search
pgvector on your PostgresFree (compute only)5–50 msBest up to ~10M vectors
Qdrant Cloud (managed)From €25 / month10–30 msStrong filtering, open source
Pinecone StandardFrom $70 / month10–30 msServerless options available
Azure AI Search (Basic)From €70 / month20–80 msEU-region, integrates with Azure OpenAI
GPT-4o-mini answer$0.15 / $0.60 per 1M tokens (in/out)300–800 msDefault for most RAG
Claude Sonnet 4 answer$3 / $15 per 1M tokens500–1500 msBest for nuanced grounded reasoning

Typical end-to-end query (production)

Embed (50ms) + vector search (20ms) + rerank (150ms) + LLM (600ms) = ~800ms p50 latency, ~$0.002–$0.01 per query. The hidden cost most teams miss: re-embedding the corpus when you upgrade models. Embedding 10M chunks at 200 tokens each = 2B tokens × $0.02/1M = ~$40 per re-embed — cheap with small embeddings, painful with large. Budget for it.

Going Further

Tomorrow

Run Level 1 against 50 of your team's documents. Measure: does it find the right one for 10 real questions? You will learn more from one afternoon of testing than a week of reading.

Next month

Ship Level 2 to staging. Add a RAGAS evaluation harness. A/B test two embedding models on your real queries. Decide between pgvector (if you already have Postgres) and Qdrant/Pinecone (if you need scale).

Long-term

Explore GraphRAG (Microsoft), Contextual Retrieval (Anthropic), HyDE, RAG-Fusion, and Adaptive RAG. Read the original Lewis et al. paper and follow Eugene Yan's RAG patterns.

Curated Resources

Self-Check: Have You Learned It?

1. Why does RAG usually beat fine-tuning for company knowledge?

Answer: Company knowledge changes constantly (policies, prices, employees). Fine-tuning bakes facts into model weights — every change means re-training. RAG lets you re-embed a single document in seconds and the model sees the new version on the next query.

2. What is the difference between retrieval and re-ranking?

Answer: Retrieval is fast and approximate — it finds ~20 candidate chunks via vector similarity in tens of milliseconds. Re-ranking is slower and precise — it uses a cross-encoder to score actual relevance of those candidates, then keeps the top 3–5. Without re-ranking, the most relevant chunk is often in your top-20 but not at position 1.

3. Your RAG system returns chunks but the answers are wrong. Where do you look first?

Answer: Inspect the actual chunks being passed to the LLM. 80% of the time the answer is: the right chunk was not retrieved (chunking too aggressive, wrong embedding model, missing metadata filter, no re-ranking). 15%: the chunks are right but the prompt does not enforce grounding. Only 5% is "the LLM hallucinated despite good context."

4. You serve 10 customers from one RAG. How do you prevent customer A from seeing customer B's documents?

Answer: Metadata filter by tenant_id applied before vector search. Every chunk's metadata includes the tenant; every query passes the requesting tenant; the vector DB only searches within that tenant's vectors. Never rely on the LLM to filter — filter at the database level.

5. What is wrong with this code?
chunks = db.query(query_embeddings=[q], n_results=50)
context = "\n".join(chunks)
prompt = f"Answer: {question}\nContext: {context}"

Answer: Three issues. (1) 50 chunks is too many — the LLM loses focus. Retrieve 20, re-rank to 3–5. (2) No grounding instruction — the LLM is not told to answer only from the context. (3) No fallback if scores are low — the model will improvise. Also: "\n".join destroys chunk boundaries; use "\n\n---\n\n".

6. When would you choose long-context LLMs over RAG?

Answer: When you have one document (under 200K tokens), low query volume, and need the model to reason across the whole thing. Examples: summarising a single annual report, answering questions about one long contract, code review of a single repo. For anything beyond that, RAG is cheaper, faster, and more accurate.

7. Design a RAG for an internal HR assistant that handles 200 employees' personal data. What three things would you do differently from a generic RAG?

Answer: (1) Per-employee metadata filtering — every chunk tagged with employee_id, query filtered so the LLM only sees the asking employee's data. (2) Run everything in your own Azure tenant in EU-region — Azure OpenAI for the LLM, Azure AI Search for vectors, no third-party APIs. (3) Strict audit logging — every query logs the user, the chunks retrieved, the prompt, and the answer, retained per AVG/GDPR rules. Plus a deny-list for sensitive actions ("never reveal another employee's salary").

Frequently Asked Questions

What does RAG stand for?

RAG stands for Retrieval-Augmented Generation. It was introduced by Lewis et al. at Facebook AI Research in May 2020.

When should I use RAG instead of fine-tuning?

Use RAG when data changes often, you need citations, or you have more than a few thousand documents. Use fine-tuning for behaviour the model needs to learn (style, format, domain reasoning).

Do I need a vector database to do RAG?

For prototypes, no — NumPy in memory works. For production, yes — pgvector, Qdrant, Pinecone, Azure AI Search, Weaviate, Milvus, or Chroma.

How much does RAG cost?

$0.02 per 1M tokens for OpenAI embeddings (2026), $25–$70/month for hosted vector DBs, $0.001–$0.01 per query end-to-end. Hidden cost: re-embedding when you upgrade models.

Why does RAG sometimes give wrong answers?

Usually retrieval failure: wrong chunk retrieved, no re-ranking, no metadata filter, or no grounding instruction in the prompt. Inspect the chunks before blaming the LLM.

What is the difference between RAG and an AI agent?

RAG is a retrieval pattern. An agent is a system that decides which tools (including RAG) to call. Most production agents use RAG as one of their tools.

What is chunking in RAG?

Splitting documents into smaller pieces (typically 200–1,000 tokens) before embedding. Good chunking respects paragraphs, headings, and code blocks.

What is re-ranking and do I need it?

A second-pass model that scores top-20 retrievals for actual relevance and keeps the best 3–5. Lifts retrieval quality 20–30% in production. Yes, you almost always need it.

Can RAG work with private or compliance-sensitive data?

Yes — that is its strongest use case. Run embeddings and vector DB in your own Azure/AWS region, keep documents in your tenant, route LLM through Azure OpenAI or a self-hosted model.

What is the difference between RAG and long-context LLMs?

Long-context lets you paste one big document into a prompt. RAG scales to millions of documents, costs less per query, and stays accurate beyond 100K tokens of context.

Next read: how RAG fits into the full agent stack

RAG is one of seven skills a production agent engineer needs. To see how retrieval connects with tool contracts, reliability, security, and observability, read From Prompt Engineer to Agent Engineer: The 7 Skills That Actually Matter.

Ready to ship this in production?

IITS builds production-ready Azure data and AI systems for Dutch businesses. Fixed prices, 2-week pilots, EU-region Azure.

Book a strategy call →