Cross-Encoder Reranking

TL;DR

A cross-encoder concatenates query and candidate document into a single sequence and feeds them jointly to a transformer that outputs a relevance score.
Strictly more accurate than the bi-encoders used for first-stage retrieval, because every token of the query can attend to every token of the document.
Strictly slower — each (query, document) pair is a full forward pass, so cross-encoders cannot be used to score millions of documents directly; they rerank a small top-k from cheaper retrieval.
Standard production pattern: retrieve 50-200 candidates with hybrid search, rerank to top 5-10 with a cross-encoder, pass those to the generator.

The Bi-Encoder vs Cross-Encoder Split#

A bi-encoder is two separate forward passes — one over the query, one over the document — producing two vectors that are compared with a cheap similarity function. A cross-encoder is one forward pass over the concatenation '[CLS] query [SEP] document [SEP]', producing a single relevance score. The cross-encoder is more accurate because the transformer can model term-level interactions between query and document; the bi-encoder is faster because document encoding happens once at index time.

The two are used in series. The bi-encoder fetches a candidate pool (hundreds of documents per query) from an index of millions. The cross-encoder then scores each (query, candidate) pair to produce the final ranking. This two-stage architecture is now the default for any retrieval system where quality matters.

Production Reranker Families#

Model	Author	Size	Property
ms-marco-MiniLM-L-6-v2	sentence-transformers	22M params	Fast baseline, CPU-friendly
bge-reranker-v2-m3	BAAI	568M params	Multilingual, long context
Cohere Rerank 3	Cohere	Proprietary API	Strong multilingual, low latency
Voyage rerank-2	Voyage AI	Proprietary API	Domain-tuned variants available
Jina Reranker v2	Jina AI	Open weights	Multilingual, function-calling aware
Qwen3-Reranker	Alibaba	0.6B / 4B / 8B	Open weights, top of BEIR in 2025

Latency Budget#

A small cross-encoder (MiniLM class) scores roughly 1000-2000 (query, document) pairs per second on a single modern GPU. A 568M-parameter reranker scores 100-300 pairs per second. The implication is direct: if your retrieval stage returns 50 candidates and you need a P95 latency under 200 ms for reranking, you cannot use a 7B-parameter reranker — pick the smaller one or reduce the candidate count.

Hosted APIs (Cohere, Voyage, Jina) make a different trade-off — slightly more accurate, but adding a network round-trip and a per-query cost. For latency-sensitive applications, a self-hosted reranker on the same machine as vLLM usually wins on total system latency.

Rerankers are the single highest-leverage quality improvement in a basic RAG stack. Adding any decent cross-encoder typically lifts answer faithfulness 10-20 percentage points before tuning anything else.

LLM-as-Reranker#

A different pattern uses a general-purpose LLM as the reranker — prompted to score or sort the candidate list. Listwise approaches (RankGPT, RankZephyr) feed the entire candidate list at once and ask the model to return a permutation. Quality matches or beats trained cross-encoders on out-of-domain corpora, at much higher latency and cost. Useful when you cannot fine-tune a dedicated reranker, or for offline evaluation pipelines.

When You Don't Need Reranking#

Reranking buys you precision at the top of the ranked list. If your downstream LLM has a long context and you are happy passing 30+ chunks into the prompt, the marginal value of a reranker shrinks. Conversely, if you are passing 3-5 chunks (typical for short-context or cost-constrained generation), reranking is essential because the top of the retrieval ranking is usually noisy.

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks · arXiv (Reimers & Gurevych, 2019)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents · arXiv (Sun et al., 2023)
BGE Reranker Documentation · BAAI on Hugging Face

The Bi-Encoder vs Cross-Encoder Split#

Production Reranker Families#

Model	Author	Size	Property
ms-marco-MiniLM-L-6-v2	sentence-transformers	22M params	Fast baseline, CPU-friendly
bge-reranker-v2-m3	BAAI	568M params	Multilingual, long context
Cohere Rerank 3	Cohere	Proprietary API	Strong multilingual, low latency
Voyage rerank-2	Voyage AI	Proprietary API	Domain-tuned variants available
Jina Reranker v2	Jina AI	Open weights	Multilingual, function-calling aware
Qwen3-Reranker	Alibaba	0.6B / 4B / 8B	Open weights, top of BEIR in 2025

Latency Budget#

LLM-as-Reranker#

When You Don't Need Reranking#

Cross-Encoder Reranking

The Bi-Encoder vs Cross-Encoder Split#

Production Reranker Families#

Latency Budget#

LLM-as-Reranker#

When You Don't Need Reranking#

References

Browse all entries

Deploy on Yobitel

Cross-Encoder Reranking

The Bi-Encoder vs Cross-Encoder Split#

Production Reranker Families#

Latency Budget#

LLM-as-Reranker#

When You Don't Need Reranking#

References

Browse all entries

Deploy on Yobitel