TL;DR
- A cross-encoder concatenates query and candidate document into a single sequence and feeds them jointly to a transformer that outputs a relevance score.
- Strictly more accurate than the bi-encoders used for first-stage retrieval, because every token of the query can attend to every token of the document.
- Strictly slower — each (query, document) pair is a full forward pass, so cross-encoders cannot be used to score millions of documents directly; they rerank a small top-k from cheaper retrieval.
- Standard production pattern: retrieve 50-200 candidates with hybrid search, rerank to top 5-10 with a cross-encoder, pass those to the generator.
The Bi-Encoder vs Cross-Encoder Split#
A bi-encoder is two separate forward passes — one over the query, one over the document — producing two vectors that are compared with a cheap similarity function. A cross-encoder is one forward pass over the concatenation '[CLS] query [SEP] document [SEP]', producing a single relevance score. The cross-encoder is more accurate because the transformer can model term-level interactions between query and document; the bi-encoder is faster because document encoding happens once at index time.
The two are used in series. The bi-encoder fetches a candidate pool (hundreds of documents per query) from an index of millions. The cross-encoder then scores each (query, candidate) pair to produce the final ranking. This two-stage architecture is now the default for any retrieval system where quality matters.
Production Reranker Families#
| Model | Author | Size | Property |
|---|---|---|---|
| ms-marco-MiniLM-L-6-v2 | sentence-transformers | 22M params | Fast baseline, CPU-friendly |
| bge-reranker-v2-m3 | BAAI | 568M params | Multilingual, long context |
| Cohere Rerank 3 | Cohere | Proprietary API | Strong multilingual, low latency |
| Voyage rerank-2 | Voyage AI | Proprietary API | Domain-tuned variants available |
| Jina Reranker v2 | Jina AI | Open weights | Multilingual, function-calling aware |
| Qwen3-Reranker | Alibaba | 0.6B / 4B / 8B | Open weights, top of BEIR in 2025 |
Latency Budget#
A small cross-encoder (MiniLM class) scores roughly 1000-2000 (query, document) pairs per second on a single modern GPU. A 568M-parameter reranker scores 100-300 pairs per second. The implication is direct: if your retrieval stage returns 50 candidates and you need a P95 latency under 200 ms for reranking, you cannot use a 7B-parameter reranker — pick the smaller one or reduce the candidate count.
Hosted APIs (Cohere, Voyage, Jina) make a different trade-off — slightly more accurate, but adding a network round-trip and a per-query cost. For latency-sensitive applications, a self-hosted reranker on the same machine as vLLM usually wins on total system latency.
Rerankers are the single highest-leverage quality improvement in a basic RAG stack. Adding any decent cross-encoder typically lifts answer faithfulness 10-20 percentage points before tuning anything else.
LLM-as-Reranker#
A different pattern uses a general-purpose LLM as the reranker — prompted to score or sort the candidate list. Listwise approaches (RankGPT, RankZephyr) feed the entire candidate list at once and ask the model to return a permutation. Quality matches or beats trained cross-encoders on out-of-domain corpora, at much higher latency and cost. Useful when you cannot fine-tune a dedicated reranker, or for offline evaluation pipelines.
When You Don't Need Reranking#
Reranking buys you precision at the top of the ranked list. If your downstream LLM has a long context and you are happy passing 30+ chunks into the prompt, the marginal value of a reranker shrinks. Conversely, if you are passing 3-5 chunks (typical for short-context or cost-constrained generation), reranking is essential because the top of the retrieval ranking is usually noisy.
References
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks · arXiv (Reimers & Gurevych, 2019)
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents · arXiv (Sun et al., 2023)
- BGE Reranker Documentation · BAAI on Hugging Face