Curiosity · Concept

BM25 (Okapi BM25)

Okapi BM25, developed in the 1990s from probabilistic relevance theory, is still the workhorse of keyword search — used by Lucene, Elasticsearch, OpenSearch, and most hybrid RAG systems as the sparse retriever. It scores each (query-term, document) pair with a saturating TF component, an IDF weight, and a length normalization term tuned by two constants k1 and b. Despite being decades old, BM25 is hard to beat on exact-match and long-tail queries and remains an essential baseline in any IR benchmark.

Quick reference

Proficiency: Intermediate
Also known as: Okapi BM25, Best Matching 25
Prerequisites: TF-IDF, tokenization

Frequently asked questions

What is BM25?

BM25 is a probabilistic ranking function that scores how well a document matches a keyword query using term frequency, inverse document frequency, and document length normalization. It powers Lucene, Elasticsearch, and most keyword search engines.

How is BM25 different from TF-IDF?

BM25 saturates term frequency (a term appearing 50 times isn't 50x better than once) via the k1 parameter, and normalizes by document length via b. Plain TF-IDF does neither, so it tends to over-reward keyword stuffing and long documents.

What are k1 and b?

k1 controls how quickly TF saturates (typical 1.2-2.0); higher means extra occurrences matter more. b controls length normalization (0-1, typical 0.75); 0 disables it, 1 fully normalizes. Defaults work well on most corpora.

Is BM25 obsolete now that we have embeddings?

No. BM25 remains state-of-the-art on many exact-match and rare-term queries, is cheap to index and serve, and is a required component in hybrid search. Most production RAG systems run it alongside a dense retriever.

Sources

Robertson & Zaragoza — The Probabilistic Relevance Framework: BM25 and Beyond — accessed 2026-04-20
Wikipedia — Okapi BM25 — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related