Curiosity · Concept
BM25 (Okapi BM25)
Okapi BM25, developed in the 1990s from probabilistic relevance theory, is still the workhorse of keyword search — used by Lucene, Elasticsearch, OpenSearch, and most hybrid RAG systems as the sparse retriever. It scores each (query-term, document) pair with a saturating TF component, an IDF weight, and a length normalization term tuned by two constants k1 and b. Despite being decades old, BM25 is hard to beat on exact-match and long-tail queries and remains an essential baseline in any IR benchmark.
Quick reference
- Proficiency
- Intermediate
- Also known as
- Okapi BM25, Best Matching 25
- Prerequisites
- TF-IDF, tokenization
Frequently asked questions
What is BM25?
BM25 is a probabilistic ranking function that scores how well a document matches a keyword query using term frequency, inverse document frequency, and document length normalization. It powers Lucene, Elasticsearch, and most keyword search engines.
How is BM25 different from TF-IDF?
BM25 saturates term frequency (a term appearing 50 times isn't 50x better than once) via the k1 parameter, and normalizes by document length via b. Plain TF-IDF does neither, so it tends to over-reward keyword stuffing and long documents.
What are k1 and b?
k1 controls how quickly TF saturates (typical 1.2-2.0); higher means extra occurrences matter more. b controls length normalization (0-1, typical 0.75); 0 disables it, 1 fully normalizes. Defaults work well on most corpora.
Is BM25 obsolete now that we have embeddings?
No. BM25 remains state-of-the-art on many exact-match and rare-term queries, is cheap to index and serve, and is a required component in hybrid search. Most production RAG systems run it alongside a dense retriever.
Sources
- Robertson & Zaragoza — The Probabilistic Relevance Framework: BM25 and Beyond — accessed 2026-04-20
- Wikipedia — Okapi BM25 — accessed 2026-04-20