TL;DR
- BM25 (Best Matching 25) is a bag-of-words ranking function from the Okapi project at City University London, refined throughout the 1990s by Stephen Robertson, Karen Spärck Jones and colleagues.
- Definitive modern treatment is Robertson and Zaragoza's 2009 monograph 'The Probabilistic Relevance Framework: BM25 and Beyond'.
- Scores a document by summing per-term contributions of inverse document frequency, term frequency saturation, and document-length normalisation.
- Still the strongest single sparse retriever in 2026 and the standard sparse leg of hybrid retrieval systems.
The Formula#
BM25 scores a document D against a query Q as the sum over each query term q_i of an IDF weight multiplied by a term-frequency saturation factor. The IDF component down-weights common terms; the saturation factor (controlled by parameter k1, typically 1.2-2.0) makes additional occurrences of a term progressively less valuable; the length normalisation factor (controlled by b, typically 0.75) penalises long documents that accumulate matches by sheer size.
The two hyperparameters k1 and b are normally left at their defaults. Tuning them rarely moves nDCG by more than a percentage point on general corpora, though specialised collections (very long or very short documents) sometimes benefit from adjustment.
Why It Has Lasted#
BM25 is over thirty years old and still difficult to beat on out-of-domain corpora. Dense retrievers trained on MS MARCO routinely outperform BM25 on MS MARCO; the same retrievers often lose to BM25 on BEIR — the benchmark suite of thirteen diverse IR tasks Thakur et al. (2021) assembled specifically to test generalisation. The reason is that BM25 has no learned parameters tied to a training distribution, so it cannot overfit one.
It also has near-zero serving cost. An inverted index on a billion documents fits comfortably on a single machine and answers queries in single-digit milliseconds. No GPU, no embedding model, no warm-up.
BM25 should be the baseline every retrieval system is benchmarked against. If your fancy dense retriever does not beat BM25 on your own labelled set, something is wrong with the dense pipeline.
BM25F and BM25+#
Two well-known variants extend the base formula. BM25F (field-weighted) scores documents that have multiple fields — title, body, anchor text — with per-field length normalisation and per-field weights, which matters for web search and structured corpora. BM25+ (Lv and Zhai, 2011) fixes a known issue where long documents that perfectly match the query can be ranked below shorter, partial matches, by adding a small constant to the term-frequency component.
Tokenisation Matters More Than the Formula#
The single largest source of BM25 quality variation between implementations is tokenisation. Lowercasing, stemming (Porter, Snowball, Krovetz), stop-word lists, punctuation handling, n-gram generation, and language-specific analysers (CJK segmentation, Arabic root extraction) all change which terms enter the inverted index. Two BM25 implementations on the same corpus with different analysers can differ by 10+ points of nDCG.
Implementations#
- Lucene / Elasticsearch / OpenSearch — the reference production implementation. BM25 has been Lucene's default similarity since 2015.
- Tantivy — Rust port of Lucene, used inside Qdrant and ParadeDB.
- rank_bm25 — pure-Python library, fine for prototyping, too slow for production-scale corpora.
- BM25S — 2024 Python library that uses sparse matrices and runs 100-500x faster than rank_bm25.
- PostgreSQL via ParadeDB / pg_search — brings Lucene-style BM25 to pgvector deployments.
References
- The Probabilistic Relevance Framework: BM25 and Beyond · Robertson & Zaragoza, Foundations and Trends in IR (2009)
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models · arXiv (Thakur et al., 2021)
- Lucene Similarity Documentation · Apache Lucene Docs