TL;DR
- Dense embeddings (BERT-derived, E5, BGE, GTE, OpenAI text-embedding-3) map text into a fixed-dimensional vector where similar meanings cluster geometrically.
- Sparse embeddings (BM25, SPLADE, TF-IDF) represent text as high-dimensional sparse vectors keyed on vocabulary terms, preserving exact-match signal.
- Hybrid retrieval — dense for semantic recall, sparse for lexical precision, fused by reciprocal rank fusion or learned reranking — is now standard in RAG systems.
- Dense vector dimensions range from 384 (small efficient models) to 4096 (large frontier models); MTEB is the canonical benchmark for English text embeddings.
Dense Embeddings#
Dense text embeddings are produced by an encoder model — usually a Transformer encoder — that maps a piece of text into a single vector of fixed dimension. Two texts with similar meaning produce vectors with high cosine similarity; unrelated texts produce orthogonal vectors.
The training recipe is contrastive: pairs of related texts (query/document, sentence/paraphrase) are pulled together, and unrelated pairs are pushed apart. InfoNCE loss over large batches is the workhorse. Curated datasets like MS MARCO, NLI and the BGE training corpus drive most public English models.
The Modern Embedding Model Stack#
Matryoshka representation learning (Kusupati et al., 2022) trains a single model whose vector can be truncated to smaller dimensions with graceful quality loss — useful for storage-constrained deployments.
| Model | Dim | MTEB avg (English) | Notes |
|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3072 (Matryoshka) | ~64.6 | Closed weights, API only |
| BGE-M3 | 1024 | ~64.0 | Open, multilingual, multi-functional |
| E5-Mistral-7B-instruct | 4096 | ~66.6 | LLM-initialised, very strong |
| GTE-Qwen2-7B-instruct | 3584 | ~70.2 | Top open model, mid-2025 |
| nomic-embed-text-v2 | 768 | ~62.4 | Open, multilingual |
| all-MiniLM-L6-v2 | 384 | ~56.5 | Small, fast, ubiquitous baseline |
Sparse Embeddings#
Sparse embeddings represent text as vectors with one dimension per vocabulary term. The classical approach (TF-IDF, BM25) uses term frequencies and inverse document frequencies; neural sparse approaches (SPLADE, uniCOIL) learn term weights from a Transformer encoder.
Sparse vectors have tens to hundreds of non-zero entries out of millions of dimensions. They are stored efficiently as inverted indexes — the same data structure search engines have used for fifty years — and support exact-match queries that dense embeddings notoriously struggle with (model numbers, error codes, names).
Why Hybrid Wins#
Dense embeddings excel at semantic similarity: 'how do I cancel my subscription' will retrieve documents about 'ending your plan' even if no words overlap. Sparse embeddings excel at lexical precision: searching for 'CVE-2024-3094' will retrieve the exact CVE document, where dense models often diffuse to nearby but unrelated CVEs.
Hybrid retrieval queries both indexes in parallel and fuses the results. Reciprocal rank fusion (RRF) is the standard cheap method; learned cross-encoder rerankers (bge-reranker, Cohere Rerank) produce the strongest final ordering. Production RAG pipelines almost universally use hybrid retrieval.
If your RAG system is missing obvious exact-match results, you almost certainly need sparse retrieval (BM25) in the pipeline. Dense alone is a common but avoidable failure mode.
Storage and Search#
Dense vectors are stored in vector databases (pgvector, Qdrant, Weaviate, Milvus, Pinecone) that implement approximate nearest-neighbour search via HNSW or IVF-PQ indexes. A typical RAG corpus of 10 million chunks at dim 1024 in float16 is ~20 GB; quantisation to int8 brings it to 10 GB; binary quantisation (1-bit) to ~1.25 GB with modest quality loss.
Sparse vectors are stored in inverted indexes (Elasticsearch, OpenSearch, Vespa, Tantivy). The same 10M chunks may take 1-5 GB depending on vocabulary and posting list compression. BM25 scoring is essentially free at query time.
Long Documents and Chunking#
Most embedding models have a context limit of 512 to 8192 tokens. Long documents must be chunked. Naive fixed-size chunking (256-1024 tokens) is the baseline; structural chunking (per-section, per-paragraph) usually improves recall when documents have hierarchical structure. ColBERT-style late interaction (see ColBERT entry) is the alternative for retaining per-token granularity.