What is an Embedding? Vectors for Meaning, Explained (2026)

An embedding is a list of numbers that captures the meaning of text, an image, or audio in a fixed-length vector. Similar things have similar vectors. Embeddings power semantic search, RAG, clustering, classification and recommendation. Here is how they work, what they cost, and which one to use.

22 min read·2026-05-27·Foundation
On this page

Agentic AI · Updated May 2026 · Reviewed by IITS Azure data engineering team · Den Haag

Embedding = Meaning, as a list of numbers.

An embedding is a fixed-length vector (a list of numbers) that captures the meaning of a piece of text, an image, or audio. Sentences about similar topics produce similar vectors; unrelated topics produce far-apart vectors. Embeddings are the foundation of semantic search, RAG, clustering, classification, and recommendation. Introduced for words by Mikolov et al. (word2vec, 2013), now produced by transformer encoders that embed full sentences and documents in a single forward pass.

What it is

A vector that represents meaning.

What it solves

Comparing text by meaning, not by exact words.

Who needs it

Anyone building search, RAG, or recommendation.

After this guide

You will embed text, compare it, and pick a model.

From Text to Meaning

TEXT INPUT "What is our refund policy?" EMBEDDING MODEL Transformer encoder text-embedding-3-small VECTOR (1,536 dims) [0.012, -0.831, 0.445, 0.117, ...] PLOTTED IN MEANING SPACE "refund policy" "return goods" "money back" "shipping rates" "delivery time" similar meanings cluster together · unrelated topics drift apart

Text in. Vector out. The vector is a coordinate in a high-dimensional "meaning space" where things that mean similar things live close together.

Prerequisites & Setup

You need basic Python (functions, virtual environments, pip), the ability to make an HTTP request, and an API key for OpenAI or Azure OpenAI. NumPy for the math.

Install once

python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install openai sentence-transformers numpy

Set OPENAI_API_KEY as an environment variable. Get one at platform.openai.com/api-keys, or use Azure OpenAI in EU-region for production.

Why Embeddings Exist (The Story)

Before 2013, computers did not understand meaning. They understood exact strings. "The cat sat on the mat" and "A feline rested on the rug" looked completely unrelated, even though they mean the same thing. Search engines worked on keyword matching: type "refund" and you only found documents containing the literal word "refund."

  • 2013 — word2vec (Mikolov et al., Google): showed that words could be mapped to vectors where similar words cluster together. The famous demo: king − man + woman ≈ queen.
  • 2018 — BERT (Devlin et al., Google): contextual embeddings. The meaning of "bank" now depended on whether you were talking about a river or a financial institution.
  • 2019 — Sentence-BERT (Reimers & Gurevych): made it efficient to embed full sentences, not just words.
  • 2024–2026: OpenAI text-embedding-3, Voyage AI, Cohere Embed v3, BGE-M3 — instruction-tuned, multilingual, multimodal embeddings became the standard.

The breakthrough insight

If you can turn meaning into geometry, you can compare meaning with maths. Two sentences? Two vectors. Their similarity? The angle between them. That single move unlocked semantic search, RAG, clustering, recommendation, deduplication, and most modern NLP.

By 2026, embeddings are the foundation of every serious LLM application. Every RAG system computes them. Every recommendation engine uses them. Every semantic search builds on them.

The Mental Model

The map of meaning analogy

Imagine a giant 1,536-dimensional map. Every piece of text that has ever been written is a single dot on that map. Sentences about refunds cluster in one neighbourhood. Sentences about shipping live in another. Sentences about cooking are in a completely different region. The model's job is to drop each new sentence at the right coordinate. Embeddings are those coordinates.

Five terms you must internalise:

Vector / Dimension

Technical: a fixed-length list of floating-point numbers, typically 384 to 3,072 long. Friendly: a list of numbers that locates one piece of text in meaning-space. Concrete example: [0.012, -0.831, 0.445, ...]. Common misconception: bigger is not always better — 3,072 dims costs 2× the storage and search time of 1,536 dims for only ~5% quality gain.

Cosine similarity

Technical: dot product of two vectors divided by the product of their magnitudes, giving a value between -1 and 1. Friendly: a measurement of how similar two pieces of text are — 1 means identical meaning, 0 means unrelated, -1 means opposite. Common misconception: Euclidean distance is not the same — for embeddings, cosine similarity is almost always what you want.

Embedding model

Technical: a neural network (usually a transformer encoder) that takes tokens as input and outputs a single fixed-length vector. Friendly: the function that turns text into its meaning-coordinate. Common misconception: the embedding model is not the same as the chat LLM. ChatGPT does not produce embeddings — you call a separate embedding endpoint.

Semantic space

Technical: the high-dimensional vector space (1,536-D for text-embedding-3-small) in which all embeddings from one model live. Friendly: the meaning-map itself, with sentences plotted as dots. Common misconception: two models do NOT share a semantic space — an OpenAI vector and a BGE vector cannot be compared.

Caching

Technical: persisting computed embeddings to disk or a vector database so you never re-embed the same text twice. Friendly: embeddings are deterministic — same text + same model = same vector, every time. Pay once, use forever. Common misconception: forgetting to cache is the #1 way to blow up your embedding bill — even at $0.02/1M tokens, embedding a 1M-document corpus on every restart is a real waste.

Anatomy: Follow One Sentence

Let us trace one sentence through an embedding model. Input: "What is our refund policy?"

Step 1 — Tokenize

The sentence is split into tokens. With OpenAI's tokenizer, "What is our refund policy?" becomes 7 tokens. Each token is mapped to an integer ID via a lookup table.

Step 2 — Encode

The token IDs flow through a transformer encoder. Each token gets a contextual representation that depends on every other token (this is what makes "bank account" different from "river bank").

Step 3 — Pool

The encoder produces one vector per token (7 vectors for our sentence). A pooling step (usually averaging) compresses them into one single vector that represents the whole sentence.

Step 4 — Normalise

The vector is scaled to have magnitude 1. After this, cosine similarity equals plain dot product (faster to compute, identical answer).

Step 5 — Return

The output is a 1,536-dimensional vector like [0.012, -0.831, 0.445, ...]. This is the coordinate of "What is our refund policy?" in OpenAI's meaning space. Two semantically similar sentences will produce two vectors with cosine similarity close to 1.

Hands-On Level 1 — Toy Example

The smallest useful embedding example. Twelve lines. In-memory. Embed three sentences, compute cosine similarity, see which two are most alike.

import os, numpy as np
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def embed(text):
    return np.array(client.embeddings.create(input=text, model="text-embedding-3-small").data[0].embedding)

a = embed("Our refund policy allows returns within 30 days.")
b = embed("Customers may return items in the first month.")
c = embed("Shipping is free for orders above 50 euros.")

# Cosine similarity (vectors are already normalised by OpenAI, so dot product = cosine)
print(f"a vs b: {a @ b:.3f}")
print(f"a vs c: {a @ c:.3f}")

Expected output

a vs b: 0.812
a vs c: 0.318

What just happened

You turned three English sentences into three 1,536-dimensional vectors. The cosine similarity between "refund policy" and "return items" is 0.81 — the model recognises they mean almost the same thing despite zero word overlap. The "shipping" sentence is at 0.32 — clearly different topic. If you got this running, you understand what an embedding does.

Hands-On Level 2 — Realistic

Level 1 has three problems for real use: (a) every restart re-pays the embedding bill, (b) one-by-one calls are slow, (c) no error handling. Level 2 fixes all three with batching, caching, and Sentence Transformers as an offline-capable alternative.

import os, json, hashlib, numpy as np
from pathlib import Path
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
CACHE = Path("./embeddings_cache.jsonl")
MODEL = "text-embedding-3-small"

def _key(text: str) -> str:
    return hashlib.sha256(f"{MODEL}::{text}".encode()).hexdigest()

# Load cache once
cache: dict[str, list[float]] = {}
if CACHE.exists():
    for line in CACHE.read_text().splitlines():
        row = json.loads(line)
        cache[row["key"]] = row["vec"]

def embed_many(texts: list[str]) -> np.ndarray:
    """Embed a batch. Returns (n, 1536) array. Caches by text+model hash."""
    missing = [t for t in texts if _key(t) not in cache]
    if missing:
        # Batch up to 2,048 inputs per call (OpenAI limit)
        for i in range(0, len(missing), 2048):
            batch = missing[i:i+2048]
            resp = client.embeddings.create(input=batch, model=MODEL)
            with CACHE.open("a") as f:
                for text, item in zip(batch, resp.data):
                    cache[_key(text)] = item.embedding
                    f.write(json.dumps({"key": _key(text), "vec": item.embedding}) + "\n")
    return np.array([cache[_key(t)] for t in texts])

# Use it
docs = [
    "Our refund policy allows returns within 30 days.",
    "Customers may return items in the first month.",
    "Shipping is free for orders above 50 euros.",
]
vecs = embed_many(docs)
print(vecs.shape)           # (3, 1536)
print(vecs @ vecs.T)        # 3x3 similarity matrix

What changed from Level 1

  • Persistent cache: embeddings are hashed by model::text and stored to disk. Re-runs are free.
  • Batching: up to 2,048 texts per API call (50× faster than one-at-a-time).
  • Pre-normalised: OpenAI returns unit vectors, so vecs @ vecs.T directly gives the cosine similarity matrix.
  • Model versioned in the cache key: if you ever switch models, the old cache is automatically bypassed (no stale-vector contamination).

Hands-On Level 3 — Production Pattern

In production, embeddings go through a dedicated service with retries, observability, and a real cache backend (Redis or pgvector, not JSON files). Sketch:

# embeddings/service.py
from dataclasses import dataclass
from typing import Sequence
import hashlib, logging, time
from tenacity import retry, stop_after_attempt, wait_exponential

@dataclass
class EmbeddingResult:
    vector: list[float]
    model: str
    cached: bool

class EmbeddingService:
    def __init__(self, client, cache, model: str = "text-embedding-3-small", batch_size: int = 1024):
        self.client = client
        self.cache = cache  # e.g. Redis or pgvector
        self.model = model
        self.batch_size = batch_size
        self.log = logging.getLogger("embeddings")

    def _key(self, text: str) -> str:
        return hashlib.sha256(f"{self.model}::{text}".encode()).hexdigest()

    @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=8))
    def _call_api(self, batch: list[str]) -> list[list[float]]:
        resp = self.client.embeddings.create(input=batch, model=self.model)
        return [d.embedding for d in resp.data]

    def embed(self, texts: Sequence[str], *, trace_id: str = "") -> list[EmbeddingResult]:
        t0 = time.perf_counter()
        keys = [self._key(t) for t in texts]
        cached = self.cache.mget(keys)
        results: list[EmbeddingResult | None] = [
            EmbeddingResult(vector=v, model=self.model, cached=True) if v else None
            for v in cached
        ]
        missing_idx = [i for i, r in enumerate(results) if r is None]
        if missing_idx:
            # Batch the API calls
            for batch_start in range(0, len(missing_idx), self.batch_size):
                batch_idx = missing_idx[batch_start:batch_start + self.batch_size]
                batch_texts = [texts[i] for i in batch_idx]
                vectors = self._call_api(batch_texts)
                # Write through cache + populate result
                self.cache.mset({self._key(t): v for t, v in zip(batch_texts, vectors)})
                for i, v in zip(batch_idx, vectors):
                    results[i] = EmbeddingResult(vector=v, model=self.model, cached=False)
        # Observability
        n_cached = sum(1 for r in results if r and r.cached)
        self.log.info(
            "embeddings.batch",
            extra={
                "trace_id": trace_id,
                "total": len(texts),
                "cached": n_cached,
                "fresh": len(texts) - n_cached,
                "latency_ms": int((time.perf_counter() - t0) * 1000),
                "model": self.model,
            },
        )
        return results  # type: ignore[return-value]

Why retries with backoff

OpenAI rate limits trigger 429s under load. Exponential backoff with 3 attempts handles ~99% of transient failures.

Why model-versioned cache keys

If you upgrade from text-embedding-3-small to large, every old cache key changes automatically. No stale vectors leak into the new index.

Why log cache-hit rate

A healthy production embedding service has > 90% cache-hit ratio after warm-up. If yours drops, you know something is wrong (cache eviction, key collisions, model change).

Why batch by 1,024

OpenAI's max is 2,048 but smaller batches give better p99 latency. 1,024 is the sweet spot for production.

Eight Pitfalls (Lessons from Production)

1. Wrong model for the domain

The trap: "OpenAI is the default, so it must be best." The result: mediocre retrieval on legal, medical, code, or multilingual data. The fix: benchmark on YOUR domain via the MTEB leaderboard or a 100-query golden set. Voyage wins for legal/finance/code; BGE-M3 wins for multilingual.

2. No caching

The trap: "It is only $0.02 per 1M tokens, so who cares." The result: a 5M-document re-embed every deploy costs $200, takes hours, and adds latency. The fix: hash model::text and persist. Once you have a vector, you have it forever (until you change models).

3. Comparing vectors across different models

The trap: you migrate from OpenAI to BGE but only re-embed half the corpus. The result: the unmigrated vectors are silently wrong — they live in a different semantic space. Search quality collapses without any error message. The fix: tag every vector with its model name; refuse cross-model comparisons; re-embed the full corpus on a migration.

4. Forgetting to normalise

The trap: some models output unnormalised vectors. You use cosine similarity but treat it as dot product (or vice versa). The result: similarity scores are subtly wrong; rankings shift; you cannot tell why. The fix: always normalise to unit length after the API call. Then cosine == dot product (faster) and you can never be inconsistent.

5. Truncating long text silently

The trap: text-embedding-3-small has an 8,192-token input limit. You send a 50-page PDF; the model silently truncates to the first 8,192 tokens. The result: half your document is invisible. The fix: chunk before embedding (see the RAG guide). One vector per chunk; reconstruct meaning at retrieval time.

6. English-only model on multilingual data

The trap: using a model trained on English on Dutch / German / French content. The result: embeddings cluster by language, not by meaning. A Dutch refund query does not match the English refund policy in your knowledge base. The fix: use BGE-M3, Cohere multilingual-v3, or Voyage multilingual.

7. Storing raw float32 vectors at scale

The trap: 1M vectors × 1,536 dims × 4 bytes = 6 GB of RAM/disk. Fine until you hit 100M vectors. The result: infrastructure cost balloons. The fix: use binary quantisation (1 bit per dim, 32× smaller, ~95% retrieval quality preserved) or scalar quantisation (int8, 4× smaller, ~99% preserved). Most modern vector DBs support both natively.

8. Re-embedding everything when you change models

The trap: commit to a model, embed 50M chunks, then a better model comes out. Re-embedding costs $1,000+ and takes a day. The result: you stay on an old model longer than you should. The fix: design your index to support side-by-side model versions. Run new model on incoming traffic while gradually re-embedding the back-catalogue.

Skim summary if you stop here

Embedding = vector that captures meaning. Pick a model for your domain, normalise, cache aggressively, version your indexes, never compare across models. The eight pitfalls above account for ~80% of production embedding failures.

Embedding Model Comparison

Model Dimensions Cost / 1M tokens Best for Open weights
OpenAI text-embedding-3-small1,536$0.02Default EnglishNo
OpenAI text-embedding-3-large3,072$0.13Higher-quality EnglishNo
Voyage voyage-3-large1,024$0.12Code, legal, financeNo
Cohere embed-multilingual-v31,024$0.10100+ languagesNo
BGE-M3 (BAAI)1,024Free (self-host)Multilingual, on-premYes (MIT)
sentence-transformers/all-MiniLM-L6-v2384Free (self-host)Small / fast / CPUYes (Apache 2.0)
CLIP (OpenAI)512Free (self-host)Image + text togetherYes

Honest verdict

Default to OpenAI text-embedding-3-small for new English projects — best price/performance, zero infra. Switch to text-embedding-3-large only if benchmarking proves it (5% gain costs 6× more). Use Voyage for specialised domains. Use BGE-M3 when you need on-prem / EU-only / multilingual / free. Use CLIP when you embed images.

Cost & Performance Reality

Embeddings are the cheapest part of any RAG system — until they are not. The cost is dominated by how many tokens you embed, not by which model (unless you switch to text-embedding-3-large or run a big self-hosted model).

Embedding a 10M-chunk corpus (200 tokens each)

Total tokens: 2 billion. Cost:

  • text-embedding-3-small: ~$40
  • text-embedding-3-large: ~$260
  • Voyage voyage-3-large: ~$240
  • BGE-M3 on your own A100 GPU: ~$10 of compute time + you keep the GPU

Storage at 1M vectors

  • 1,536 dims × float32 = ~6 GB
  • 1,536 dims × int8 (scalar quantised) = ~1.5 GB (~99% quality)
  • 1,536 dims × binary (1 bit) = ~190 MB (~95% quality)

Latency: a single OpenAI embedding call takes 50–100ms; a batch of 1,000 takes ~500ms (10× faster per item). Self-hosted BGE-M3 on a modern GPU does ~20,000 sentences/second. CPU-only sentence-transformers do ~100 sentences/second.

Going Further

Tomorrow

Embed 200 of your team's documents with Level 1 code. Compute a similarity matrix. Find the top 5 near-duplicates you did not know existed. You will learn more in an hour than from a week of reading.

Next month

Build a semantic search over your documents using Level 2 with pgvector or Qdrant. Compare 3 embedding models on your real queries. Visualise the meaning-space with UMAP.

Long-term

Explore fine-tuning embeddings on your own data, multimodal embeddings (CLIP, SigLIP), binary quantisation at scale, and Matryoshka embeddings (variable dimensions from one model). Read the Sentence-BERT paper.

Curated Resources

Self-Check: Have You Learned It?

1. Why are two embeddings from different models not directly comparable?

Answer: Each model has its own learned semantic space. OpenAI's coordinate for "refund" lives in OpenAI-space; BGE's coordinate for "refund" lives in BGE-space. They have no common origin or axes, even if both vectors are 1,536-dimensional. Cosine similarity between them is mathematically valid but semantically meaningless.

2. What is the difference between cosine similarity and dot product?

Answer: Cosine similarity = dot product / (magnitude of a × magnitude of b). When both vectors are normalised to unit length, their magnitudes are both 1, so cosine similarity equals plain dot product. That is why most production embedding APIs return pre-normalised vectors: you can use the faster dot product everywhere and get the same answer.

3. Your semantic search returns weird results. The model is OpenAI text-embedding-3-small. Where do you look first?

Answer: Check three things in order. (1) Language: is your corpus English? text-embedding-3-small is English-strong; multilingual queries will perform poorly. (2) Domain: is it specialist text (legal, code, medical)? OpenAI is mediocre on those. Benchmark Voyage. (3) Chunking: are you embedding 5,000-word documents in one shot? They get truncated at 8,192 tokens. Chunk first.

4. You serve a multilingual EU app and need to keep all data in the EU. Which embedding setup?

Answer: Either (a) Azure OpenAI in West-Europe region with text-embedding-3-small/large — same model as OpenAI but data stays in EU under your tenant; or (b) self-host BGE-M3 on a small GPU in your own infrastructure. BGE-M3 is multilingual, open-weights, and free — perfect for sovereignty requirements. Avoid hitting api.openai.com directly: that routes through US infrastructure.

5. What is wrong with this code?
def search(query, docs):
    q_vec = embed(query)
    scores = [np.dot(q_vec, embed(d)) for d in docs]
    return sorted(zip(docs, scores), key=lambda x: -x[1])[:5]

Answer: Two big problems. (1) embed(d) is called inside the loop on every search — every search re-embeds the entire corpus. Should be done once, persisted, and re-used. (2) embed is called one document at a time instead of batched. For 10K docs that is 10K API calls instead of 5. Both fix the same way: embed and cache the corpus once at startup; only embed the query at search time.

6. When should you fine-tune an embedding model instead of using off-the-shelf?

Answer: When off-the-shelf models conflate distinctions that matter to you. Examples: two product SKUs that look near-identical to OpenAI but mean different things; internal jargon ("the Phoenix project" vs "the Apollo project") that should be far apart but ends up clustered; legal terms with specific meanings in your jurisdiction. Fine-tuning needs ~5K labelled pairs of similar/dissimilar texts and typically gives a 5–20% retrieval lift. Only worth it when you have measured the gap.

7. Design an embedding service for a multi-tenant SaaS where each tenant has private data. What three things matter most?

Answer: (1) Per-tenant cache keys — never let one tenant's cached vector be returned to another. Include tenant_id in the cache key. (2) Per-tenant rate limits — one tenant should not be able to exhaust your OpenAI quota for everyone else. (3) Per-tenant model versioning — let tenant A migrate to a new model independently of tenant B; store the model version next to each vector and refuse cross-version comparisons at query time.

Frequently Asked Questions

What is an embedding in simple terms?

A list of numbers (a vector) that represents the meaning of a piece of text, image, or audio. Things that mean similar things get similar vectors.

How many dimensions does an embedding have?

Typically 384 to 3,072 depending on the model. OpenAI text-embedding-3-small is 1,536; large is 3,072; BGE-M3 is 1,024.

How do you compare two embeddings?

Cosine similarity: closer to 1 means more similar. Most production models return normalised vectors, so cosine similarity equals plain dot product.

What is the best embedding model in 2026?

No single answer. Default to OpenAI text-embedding-3-small for English; Voyage for code/legal/finance; BGE-M3 for multilingual or on-prem. Always benchmark on YOUR queries.

How much do embeddings cost?

~$0.02 per 1M tokens for OpenAI's cheapest model. Embedding a 10M-chunk corpus costs about $40.

Can I mix embeddings from different models?

No. Different models live in incompatible vector spaces. Pick one per index; tag every vector with the model name; never compare across.

What is the difference between an embedding and a token?

A token is a chunk of text (input). An embedding is the meaning vector (output). The embedding model converts tokens into a single vector.

Why normalise embeddings?

Normalisation lets you use fast dot product instead of full cosine similarity. Most APIs already return normalised vectors.

Can embeddings represent images and audio?

Yes. CLIP and SigLIP embed images and text into the same space (search images by text). CLAP does the same for audio.

What is fine-tuning an embedding model?

Training an existing model further on your own labelled pairs so it captures domain-specific distinctions better. Typically 5–20% retrieval improvement.

Next read: how embeddings power Retrieval-Augmented Generation

Embeddings are the foundation of RAG — the technique that gives language models access to your own data. To see how they fit together with vector databases, re-ranking, and prompt engineering, read What is RAG? Retrieval-Augmented Generation Explained (2026).

Ready to ship this in production?

IITS builds production-ready Azure data and AI systems for Dutch businesses. Fixed prices, 2-week pilots, EU-region Azure.

Book a strategy call →