RAG Architecture — Retrieval-Augmented Generation

TL;DR

RAG was introduced by Patrick Lewis and colleagues at Meta FAIR, UCL and NYU in 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (arXiv:2005.11401, May 2020). The original paper trained a Dense Passage Retriever and a BART generator end-to-end; modern usage of the term has loosened to mean any architecture that conditions an LLM on documents fetched at inference time.
The premise: parametric weights are a frozen snapshot of the training corpus and cannot answer questions about events after the cutoff, cannot cite sources, and cannot enforce per-document access control. RAG moves the knowledge out of the weights into a queryable store that is updated, audited and authorised independently.
Standard production stack in 2026 — chunk and embed the corpus, store dense vectors in an ANN index (HNSW or IVF) alongside a BM25 inverted index, retrieve hybrid top-k at query time, rerank with a cross-encoder, then condition the LLM on the surviving passages via a templated prompt with explicit citation slots.
Variants matter: HyDE rewrites the query, hybrid search recovers rare-term recall, ColBERT keeps token-level interactions, GraphRAG (Microsoft, 2024) answers multi-document questions vector retrieval cannot, agentic RAG (LangGraph, LlamaIndex) treats retrieval as a tool the model calls. None replace the five-stage backbone — they sit on top of it.
Retrieval is almost always the bottleneck rather than generation. Build retrieval evaluation (Recall@k, nDCG, MRR) and answer evaluation (faithfulness, answer relevance) separately with frameworks like Ragas, TruLens or DeepEval before tuning prompts or upgrading the LLM.

Overview

Retrieval-Augmented Generation is the architectural answer to a question every team eventually asks: how do you make a language model say correct, current and source-grounded things about a corpus it was never trained on? Fine-tuning teaches the model a style or skill but is expensive to refresh and cannot expose private documents on a per-user basis. Stuffing every document into the context window is bounded by both window size and inference cost. RAG splits the responsibility — knowledge lives in an external store, the model performs reasoning and synthesis over whatever the retriever surfaces.

The pattern was formalised by Patrick Lewis and colleagues at Meta FAIR, University College London and NYU in May 2020. Their paper paired a Dense Passage Retriever with a BART sequence-to-sequence generator and trained the two jointly on open-domain question answering. The result outperformed both extractive QA systems and parametric-only generators on Natural Questions, TriviaQA and WebQuestions. The name stuck even as the architecture loosened — modern RAG almost never trains retriever and generator together, and the generator is usually a frozen instruction-tuned LLM (GPT-4, Claude, Llama 3.1, Mixtral) rather than a custom seq2seq model.

By 2026, RAG is the dominant pattern for production LLM applications anywhere private knowledge is involved: customer-support copilots, internal knowledge-base assistants, clinical decision-support tools, legal research, regulated due-diligence systems, e-commerce semantic search, code assistants reading internal monorepos. The Stack Overflow Developer Survey (2025) and a16z's enterprise LLM telemetry both show RAG in well over half of production GenAI workloads, dwarfing fine-tuning by deployment count.

This entry is the concept reference for engineers building or operating a RAG system. It covers the five-stage pipeline, the dominant variants (naive, advanced, hybrid, HyDE, agentic, graph, ColBERT late-interaction), the long-context-vs-RAG trade-off as context windows reach the million-token range, the evaluation discipline that distinguishes retrieval failures from generation failures, and the practical implementation choices (chunking, embedding model, reranker, index, prompt template) that determine whether a RAG system actually answers questions or only sounds like it does. Yobitel AI Applications like MediQuery use this pattern in production; the same pattern is what customers build on Yobibyte when they assemble their own retrieval pipeline against the platform's managed embedding, reranker and inference endpoints. This entry helps you design a RAG pipeline that performs in production rather than only demos well.

How it works: the five-stage pipeline

Almost every production RAG system, regardless of framework or vendor, decomposes into the same five stages. Quality wins and operational pain both live at the boundaries between them; understanding the data shape at each boundary is the difference between a system that improves under tuning and one that collapses every time a knob is touched.

Stage 1 — Ingestion and chunking. Documents arrive in heterogeneous formats (PDF, HTML, Markdown, Word, slide decks, audio transcripts, source code). A parser extracts plain text and minimal structural metadata (headings, page numbers, source URI); a chunker splits that text into passages sized for the embedding model's context window. Chunk size is the most underestimated lever in the entire system — too small and individual chunks lose context; too large and the embedding loses precision. See chunking-strategies for the full discussion. Output of this stage: a stream of (chunk_text, metadata) records.

Stage 2 — Embedding. Each chunk is passed through a bi-encoder embedding model (BGE, E5, Nomic Embed, OpenAI text-embedding-3, Cohere Embed v4, Voyage 3) that produces a fixed-dimensional dense vector. The same model is used at query time to encode the user query, so the two vectors live in a comparable space. Many production systems also generate a sparse vector at this stage (BM25 term weights or SPLADE) for hybrid retrieval. Output: (chunk_id, dense_vector, sparse_terms, metadata).

Stage 3 — Indexing. Vectors are written to an approximate-nearest-neighbour index. HNSW is the default for in-memory, high-recall workloads (the standard inside Qdrant, Weaviate, Elastic, pgvector since 0.5). IVF with Product Quantisation is preferred at billion-scale on FAISS GPU. Sparse terms go into a parallel inverted index (Lucene, Tantivy, ParadeDB). Metadata is stored in a relational table indexed for the filter columns most queries use. The two indices must stay in sync with the source corpus — idempotent ingestion pipelines that write both stores in a single transaction are the safest pattern.

Stage 4 — Retrieval and reranking. At query time, the user question is embedded, the ANN index returns top-k candidates (typically 20-200), the BM25 index returns its own top-k, and the two lists are fused (Reciprocal Rank Fusion is the default; weighted-score fusion when scores are calibrated). The fused list is then re-scored by a cross-encoder reranker (bge-reranker-v2-m3, Cohere Rerank 3, Voyage rerank-2, Qwen3-Reranker) that reads each (query, candidate) pair jointly and produces a high-fidelity relevance score. The top 3-10 survivors are the working set for generation.

Stage 5 — Generation. The surviving chunks are inserted into a prompt template alongside the user query, with explicit slots for citation back to the source chunks. The LLM produces a grounded answer; well-engineered prompts force the model to refuse when retrieved chunks do not support an answer rather than hallucinate around the gap. The output is post-processed (citation extraction, hallucination check, safety classifier) and returned to the user.

Ingestion + chunking — parses heterogeneous source documents, splits into embedding-sized passages, attaches structural metadata.
Embedding — bi-encoder maps each chunk and each query into a shared vector space; sparse terms generated in parallel for hybrid retrieval.
Indexing — dense vectors into an ANN index (HNSW, IVF, ScaNN), sparse terms into BM25 inverted index, metadata into a filterable store.
Retrieval + reranking — hybrid top-k fetch fused by RRF, then re-scored by a cross-encoder to produce the working set.
Generation — prompt template with citation slots; LLM conditioned on the surviving chunks, output classified and returned.

Quick illustration: a 50-line RAG pipeline in Python

The snippet below is the shortest end-to-end RAG implementation that exercises all five stages with real tools — recursive chunking, the BAAI/bge-small-en-v1.5 embedding model, FAISS for vector storage, a cross-encoder for reranking, and any OpenAI-compatible chat endpoint for generation. It runs today on a single laptop with pip install and is intended as a learning artefact, not a production deployment. The interesting production decisions (hybrid retrieval, metadata filtering, citation post-processing, evaluation harness) are deferred to the dedicated entries.

# rag_minimal.py — runs with: pip install sentence-transformers faiss-cpu openai
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss, numpy as np, re, os
from openai import OpenAI

# 1. Ingestion + recursive chunking (paragraph -> sentence fallback).
def chunk(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    paragraphs = re.split(r"\n\s*\n", text)
    out, buf = [], ""
    for p in paragraphs:
        if len(buf) + len(p) <= size:
            buf = (buf + "\n\n" + p).strip()
        else:
            if buf:
                out.append(buf)
            buf = p[-overlap:] + p if len(p) > size else p
    if buf:
        out.append(buf)
    return out

corpus = open("knowledge.txt").read()
chunks = chunk(corpus)

# 2. Embedding (bi-encoder, 384-dim).
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
chunk_vecs = embedder.encode(chunks, normalize_embeddings=True)

# 3. Indexing (FAISS flat IP — swap for HNSW at scale).
index = faiss.IndexFlatIP(chunk_vecs.shape[1])
index.add(chunk_vecs.astype("float32"))

# 4. Retrieval + cross-encoder reranking.
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def retrieve(query: str, k_first: int = 50, k_final: int = 5) -> list[str]:
    q = embedder.encode([query], normalize_embeddings=True).astype("float32")
    _, ids = index.search(q, k_first)
    cands = [chunks[i] for i in ids[0]]
    scores = reranker.predict([(query, c) for c in cands])
    ranked = sorted(zip(scores, cands), key=lambda x: x[0], reverse=True)
    return [c for _, c in ranked[:k_final]]

# 5. Generation against any OpenAI-compatible endpoint.
client = OpenAI(base_url=os.environ["OPENAI_BASE_URL"], api_key=os.environ["OPENAI_API_KEY"])
def answer(question: str) -> str:
    ctx = retrieve(question)
    prompt = (
        "Answer the question using ONLY the context. If the context does not "
        "contain the answer, reply 'I don't know.' Cite chunks as [1], [2], ...\n\n"
        + "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(ctx))
        + f"\n\nQuestion: {question}\nAnswer:"
    )
    out = client.chat.completions.create(
        model=os.environ.get("MODEL", "gpt-4o-mini"),
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return out.choices[0].message.content

print(answer("What does the policy say about overtime?"))

Tip: The 'I don't know' guardrail in the prompt is the single most impactful line of code in a basic RAG system. Without it, the model will paper over retrieval gaps with fluent fiction; with it, the failure mode becomes a visible refusal that the eval harness can detect and the user can route to a human.

Variants and architectural choices

The naive five-stage pipeline plateaus at modest quality on any non-trivial corpus. The literature since 2022 has produced a small catalogue of well-understood extensions, each targeting a specific failure mode of the basic system. Most production RAG stacks in 2026 use three or four of these in combination.

Naive RAG — embed the query, fetch top-5 by cosine similarity, concatenate into a prompt. Useful as a baseline; fails on rare-term queries, paraphrase mismatch, and any corpus with mixed-modality structure. Advanced RAG — adds query rewriting, hybrid retrieval, cross-encoder reranking and contextual compression. The de facto default for production systems where quality matters and latency budgets allow 200-500 ms for retrieval.

Hybrid search (BM25 + dense) — runs a sparse keyword retriever and a dense embedding retriever in parallel and fuses the ranked lists with Reciprocal Rank Fusion. Recovers the recall that pure dense retrieval loses on rare terms, product codes, acronyms and exact-string lookups. Now a first-class query type in Qdrant, Weaviate, Milvus, Elastic, OpenSearch and pgvector-via-ParadeDB. See hybrid-search.

HyDE (Hypothetical Document Embeddings, Gao et al., 2022) — instead of embedding the user's question, ask the LLM to write a hypothetical answer first, then embed that. The reasoning is that an answer's embedding is much closer to the relevant passage's embedding than a question's is. Particularly effective on zero-shot retrieval over domain-specific corpora where the embedding model has never seen the question phrasing.

ColBERT late-interaction (Khattab and Zaharia, 2020; v2 in 2022) — keeps a vector per token rather than per passage, and scores passages by the sum of max-similarity per query token. Slower per query and larger on disk (typically 4-10x), but consistently more accurate than single-vector dense retrieval on out-of-domain corpora.

Cross-encoder reranking — covered in cross-encoder-reranking. A reranker is typically the single highest-leverage quality lift in any first-generation RAG system; adding bge-reranker-v2-m3 to a naive cosine-similarity baseline routinely lifts answer faithfulness 10-20 percentage points.

Multi-query / RAG-Fusion (LangChain) — generate several paraphrases of the user query with the LLM, retrieve top-k for each, and fuse with RRF. Pulls in semantically adjacent passages that a single query missed.

Agentic RAG (LangGraph, LlamaIndex, AutoGen, CrewAI) — treats retrieval as a tool the LLM decides to call rather than a fixed step. The agent can rewrite queries, fetch from multiple stores (vector index, SQL database, web search), inspect intermediate results, and recurse if the answer is incomplete. Higher latency and cost; necessary for multi-hop questions where one retrieval round cannot find the answer.

GraphRAG (Microsoft Research, 2024) — builds a knowledge graph from the corpus at ingest time (entity extraction, relationship extraction, community detection), then traverses the graph at query time to answer questions spanning multiple documents. Strong on 'global' questions (summarise the themes across the corpus) where dense retrieval returns thematically related but locally focused chunks.

Context-Augmented Generation (CAG) — recent variant for million-token-context models that pre-loads the entire corpus into the model's KV cache once, then answers many questions against the cached state. Eliminates per-query retrieval cost at the price of cache memory and recomputation on corpus updates. Practical only for small, slow-changing corpora that fit in context.

Variant	Targets failure mode	Cost added	When to adopt
Hybrid search	Rare-term recall, exact strings	+1 index, ~10 ms fusion	Any corpus with acronyms, codes, identifiers
Cross-encoder reranking	Top-k contains relevant but mis-ranked passages	+20-100 ms per query	Default — single biggest quality lift
HyDE	Question/answer phrasing mismatch	+1 LLM call per query	Zero-shot retrieval on novel domains
ColBERT v2	Out-of-domain single-vector failure	+4-10x index size	Domain shift kills your dense model
Multi-query / RAG-Fusion	Single query misses adjacent answers	+3-5 LLM calls per query	Question-answering, not lookup
Agentic RAG	Multi-hop, multi-store reasoning	+2-10x latency and cost	Questions that need 2+ retrieval rounds
GraphRAG	Global / multi-document questions	Heavy ingest cost, graph store	Corpus-summarisation use cases
CAG (cached context)	Per-query retrieval latency	Whole corpus in KV cache	Small, slow-changing corpus, long-context model

Where RAG is used today

Practically every production LLM application that touches private data uses RAG. The pattern is mature enough that the interesting decisions are no longer whether to retrieve, but which variant set to combine and where to spend the engineering budget.

Customer-support copilots are the most common deployment. The corpus is the help centre, the product documentation, and a frozen export of resolved support tickets. The retriever runs on every user message; the LLM answers with citations back to a help article so the support agent can verify. Klarna, Shopify, Intercom, Zendesk and Salesforce Einstein all ship variants of this. Typical stack: chunked Markdown docs, BGE or OpenAI embeddings, hybrid retrieval with BM25 catching product codes, cross-encoder reranking, GPT-4o or Claude generating with citation enforcement.

Enterprise internal Q&A — the chatbot that answers 'how do I file expenses', 'what is the parental leave policy', 'who owns the customer-data-warehouse'. Corpus is Confluence, Notion, SharePoint, Google Drive and an HR knowledge base. Glean, Unstructured, Hebbia, Sana, Mendable and Microsoft Copilot occupy this market. Distinctive challenges: per-document access control (the retrieval layer must respect the same ACLs as the source system), staleness (knowledge base churn is high), and grounded refusal (hallucination is unacceptable in HR contexts).

Clinical decision support and biomedical research — RAG over PubMed, internal clinical guidelines, and EHR notes. OpenEvidence, Glass Health, Hippocratic AI and Yobitel's MediQuery are in this space. Compliance overlay is heavy (HIPAA, NHS DSPT, UK NCSC), retrieval evaluation is critical because a missed contraindication is a patient-safety incident, and explicit citation back to a peer-reviewed source is non-negotiable.

Legal research and due diligence — RAG over case law, contracts, regulatory filings. Harvey, CoCounsel, Lexis+ AI, Spellbook. The retrievable units are paragraphs or clauses, the corpus is large and slow-changing, and citation discipline is the entire product proposition.

Code assistants reading internal monorepos — Sourcegraph Cody, Cursor, GitHub Copilot Workspace, JetBrains AI Assistant. Specialised chunking on AST boundaries (Tree-sitter), embedding models tuned on code (CodeBGE, Voyage code-2), and aggressive caching because the corpus changes every commit.

E-commerce semantic search — RAG-style semantic retrieval as a replacement for or supplement to keyword search. Algolia, Vespa, Typesense, Pinecone customers. Less LLM generation, more pure retrieval; the LLM often appears only in query expansion and reranking.

Trade-offs and the long-context-vs-RAG debate

RAG is cheaper than fine-tuning when knowledge changes, transparent because citations are first-class, and naturally compatible with per-document access control. Those are the three reasons it dominates. It is not, however, free of trade-offs, and a long-running debate since Gemini 1.5 Pro shipped 1M-token context in early 2024 — and Claude, GPT-4 and Llama 4 followed with similar or larger windows — asks whether retrieval is still necessary at all.

The long-context argument: if the entire corpus fits in the model's context window, you can skip retrieval and let the model attend over everything. No chunk-boundary errors, no recall failures, no index to maintain. Empirical results (Liu et al., 'Lost in the Middle', 2023, and subsequent needle-in-a-haystack work) showed that performance degrades sharply on facts buried in the middle of long contexts, but newer architectures (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o, Llama 3.1) recover most of that lost performance.

The RAG counter-argument: long context costs scale linearly with input length while RAG retrieval cost is roughly constant. At Gemini 1.5 Pro's pricing in 2024, answering one question over a 500-page document with full-context cost roughly 100x what the equivalent RAG query cost. Long context also does not solve access control (the model sees everything the prompt-builder passed in), does not solve corpus updates (the cache invalidates on every change), and does not solve citation (the model attends globally but does not reliably attribute).

The settled position by late 2025: RAG and long context are complementary, not competitive. Long context absorbs more chunks per query (k = 50-200 instead of k = 3-10), which raises retrieval recall ceilings and reduces the need for aggressive reranking. RAG keeps cost manageable and citation tractable. Most production systems in 2026 retrieve aggressively, pass dozens of chunks rather than five, and rely on the LLM's long-context attention to pick out the salient passages — a hybrid posture rather than an either/or.

Other operational trade-offs worth naming. Chunk-boundary loss: a fact that spans two chunks may be retrievable in neither. Sentence-window retrieval and structural chunking are the standard mitigations. Retrieval recall ceiling: if the relevant passage is not in the top-k after fusion, no amount of reranking or prompt engineering recovers it; recall is the single most important metric to monitor. Stale embeddings: swap the embedding model and you must re-encode the entire corpus, because vectors from different models do not live in the same space. Citation drift: the model can paraphrase enough that the cited chunk no longer literally supports the claim; post-hoc faithfulness checks (Ragas, TruLens) catch this.

Warning: If you cannot measure retrieval Recall@k separately from end-to-end answer quality, you cannot tell whether a bad answer is the retriever's fault or the generator's fault. Build the labelled (query, relevant chunk) eval set before tuning anything else — every later decision depends on it.

Practical implementation notes

The following choices appear in every production RAG decision document. Defaults below are the conservative starting points that work well for prose-heavy corpora at moderate scale; specialised corpora benefit from deliberate deviation.

Chunking strategy — start with recursive character splitting at 512 tokens with 64-token overlap. Move to sentence-window retrieval for dense documents where the answer is one sentence and surrounding context matters (legal contracts, technical specifications). Use structural chunking for Markdown, HTML, source code or any corpus with reliable headings (split at structural boundaries; recursively split any too-large chunk). Semantic chunking is overhyped — benchmark it against recursive splitting before adopting. See chunking-strategies.

Embedding model — BAAI/bge-small-en-v1.5 (384 dim) is the strongest open-weight starting point; BAAI/bge-large-en-v1.5 (1024 dim) when quality matters more than throughput. OpenAI text-embedding-3-large with Matryoshka truncation to 1024 or 1536 if you are already using OpenAI. Cohere Embed v4 for multilingual production. Voyage 3 for domain-tuned variants (code, finance, law). Multilingual workloads should start with intfloat/multilingual-e5-large. Avoid swapping embedding models mid-life-cycle — re-encoding the entire corpus is the cost.

Reranker — bge-reranker-v2-m3 (568M params, multilingual, long-context) is the open-weight default. Cohere Rerank 3 if hosted. Qwen3-Reranker (Alibaba, 2025) for the strongest open-weight option at 4B/8B sizes. Score 50-200 candidates from first-stage retrieval and keep 3-10. See cross-encoder-reranking.

Vector index — pgvector if you already run Postgres (transactional ACID, SQL joins, single operational story). Qdrant or Weaviate if you want a dedicated vector database with hybrid search as a first-class feature. Milvus for billion-scale workloads. FAISS for embedded retrieval inside a single process. Pinecone for fully managed with serverless billing. Cloud-native managed Postgres (RDS, Aurora, Cloud SQL, Supabase, Neon) all ship pgvector — start there unless you have a specific metric where it cannot meet your bar.

Metadata filtering — every chunk should carry source document ID, chunk offset, parent section heading, ingestion timestamp, tenant ID, and any ACL or classification labels. Pre-filter on tenant and ACL columns; post-filter on freshness. Selective filters often perform better in Postgres than in dedicated vector databases because the b-tree on the filter column does the heavy lifting before the vector search.

Evaluation — Ragas (with faithfulness, answer_relevancy, context_precision, context_recall metrics) and TruLens for end-to-end RAG evaluation; DeepEval for richer pytest-style assertions; classical IR metrics (Recall@k, MRR, nDCG@10) for the retrieval stage in isolation. Build a labelled (query, relevant chunk, gold answer) set of at least 100 examples before tuning. The eval set is the most valuable artefact in the project — invest in it.

Prompt template — explicit citation slots (number the retrieved chunks, instruct the model to reference them as [1], [2], ...), an unambiguous refusal instruction ('If the context does not contain the answer, reply I don't know'), and a hard constraint that all factual claims must trace to a chunk. Post-process the output to extract citations and verify that every cited chunk number was actually in the retrieved set.

Frameworks — LangChain and LlamaIndex are the dominant orchestration libraries. LangChain's strength is the breadth of integrations; LlamaIndex's strength is retrieval-specific abstractions (node parsers, retrievers, response synthesisers, sentence-window retrieval, GraphRAG implementation). Haystack (deepset) is the strongest enterprise alternative. Most production teams build their own thin orchestration layer over direct SDK calls once requirements crystallise — frameworks are scaffolding, not architecture.

Default starting point: 512-token recursive chunks with 64-token overlap, BGE small embeddings, pgvector HNSW index, BM25 hybrid via ParadeDB, bge-reranker-v2-m3, top-50 retrieve and top-5 rerank, GPT-4o or Claude generation with citation enforcement.
Measure retrieval Recall@10, Recall@50, nDCG@10 separately from answer faithfulness — different teams own the two fixes.
Re-encoding the corpus when the embedding model changes is unavoidable; size your ingest pipeline for it.
Per-document ACLs must be enforced at the retrieval layer, not at the LLM layer. The LLM cannot un-see a chunk you passed it.
Cache aggressively: a per-user query cache, a per-chunk embedding cache, and a prefix cache on the LLM serving engine all stack.

When RAG is the wrong tool

RAG is not universal. It is the right pattern when knowledge changes faster than you can fine-tune, when citations matter, when the corpus is too large to fit in context, or when per-document access control must be enforced. It is the wrong pattern in four well-understood situations.

First, when the task requires reasoning over the whole corpus at once — summarising an entire codebase, identifying themes across a corpus, comparing two large documents end-to-end. Retrieval surfaces a few passages; these tasks need all of them. Long-context models or GraphRAG fit better.

Second, when latency budgets are below ~200 ms. The retrieval round-trip alone (embedding + ANN search + rerank) typically costs 100-300 ms before the LLM has produced a single token. Real-time voice agents, autocomplete suggestions and trading systems often cannot afford it; fine-tuned smaller models or cached responses fit better.

Third, when the answer depends on style, behaviour or skill rather than facts. 'Answer in the voice of our brand', 'follow this customer-service script', 'always end with a follow-up question' — these belong to fine-tuning or prompt engineering, not retrieval. Retrieval gives the model facts; it does not change how the model behaves.

Fourth, when the corpus is small enough to fit in context and slow-changing enough to cache. CAG-style pre-loading often wins both on latency (no retrieval round-trip) and quality (the model attends globally). The crossover is corpus-size-dependent and shifts with each generation of long-context models — re-evaluate annually.

Where RAG fits in the Yobitel stack

RAG is the architecture underneath Yobitel's first-party vertical AI applications — MediQuery (clinical decision support over PubMed, NICE guidelines, internal hospital protocols) is the most visible example. Customers do not build the RAG pipeline themselves; the application handles ingestion, chunking, embedding, hybrid retrieval, reranking and grounded generation behind a clinician-facing UI, with HIPAA, NHS DSPT and UK NCSC compliance preserved end-to-end.

For customers building their own RAG applications on Yobibyte, every component the pattern needs is exposed as a first-class managed primitive — embedding models, cross-encoder rerankers, vector store options (pgvector inside the platform's managed Postgres, or dedicated Qdrant / Weaviate / Milvus services), and inference endpoints (OpenAI-compatible APIs for GPT-4o-class, Claude-class, Llama 3.1, Mixtral, DeepSeek-V3). The orchestration is the customer's; the building blocks are managed and observable.

InferenceBench publishes RAG-relevant benchmarks alongside raw inference figures — embedding-encode throughput, rerank-pair throughput, end-to-end p95 latency for representative RAG query shapes. The data lets customers size a RAG deployment against realistic hardware rather than vendor-datasheet numbers, and reproduces the long-context-vs-RAG cost-per-query comparison on current-generation GPUs.

References

TL;DR

RAG was introduced by Patrick Lewis and colleagues at Meta FAIR, UCL and NYU in 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (arXiv:2005.11401, May 2020). The original paper trained a Dense Passage Retriever and a BART generator end-to-end; modern usage of the term has loosened to mean any architecture that conditions an LLM on documents fetched at inference time.
The premise: parametric weights are a frozen snapshot of the training corpus and cannot answer questions about events after the cutoff, cannot cite sources, and cannot enforce per-document access control. RAG moves the knowledge out of the weights into a queryable store that is updated, audited and authorised independently.
Standard production stack in 2026 — chunk and embed the corpus, store dense vectors in an ANN index (HNSW or IVF) alongside a BM25 inverted index, retrieve hybrid top-k at query time, rerank with a cross-encoder, then condition the LLM on the surviving passages via a templated prompt with explicit citation slots.
Variants matter: HyDE rewrites the query, hybrid search recovers rare-term recall, ColBERT keeps token-level interactions, GraphRAG (Microsoft, 2024) answers multi-document questions vector retrieval cannot, agentic RAG (LangGraph, LlamaIndex) treats retrieval as a tool the model calls. None replace the five-stage backbone — they sit on top of it.
Retrieval is almost always the bottleneck rather than generation. Build retrieval evaluation (Recall@k, nDCG, MRR) and answer evaluation (faithfulness, answer relevance) separately with frameworks like Ragas, TruLens or DeepEval before tuning prompts or upgrading the LLM.

Overview

How it works: the five-stage pipeline

Ingestion + chunking — parses heterogeneous source documents, splits into embedding-sized passages, attaches structural metadata.
Embedding — bi-encoder maps each chunk and each query into a shared vector space; sparse terms generated in parallel for hybrid retrieval.
Indexing — dense vectors into an ANN index (HNSW, IVF, ScaNN), sparse terms into BM25 inverted index, metadata into a filterable store.
Retrieval + reranking — hybrid top-k fetch fused by RRF, then re-scored by a cross-encoder to produce the working set.
Generation — prompt template with citation slots; LLM conditioned on the surviving chunks, output classified and returned.

Quick illustration: a 50-line RAG pipeline in Python

# rag_minimal.py — runs with: pip install sentence-transformers faiss-cpu openai
from sentence_transformers import SentenceTransformer, CrossEncoder
import faiss, numpy as np, re, os
from openai import OpenAI

# 1. Ingestion + recursive chunking (paragraph -> sentence fallback).
def chunk(text: str, size: int = 512, overlap: int = 64) -> list[str]:
    paragraphs = re.split(r"\n\s*\n", text)
    out, buf = [], ""
    for p in paragraphs:
        if len(buf) + len(p) <= size:
            buf = (buf + "\n\n" + p).strip()
        else:
            if buf:
                out.append(buf)
            buf = p[-overlap:] + p if len(p) > size else p
    if buf:
        out.append(buf)
    return out

corpus = open("knowledge.txt").read()
chunks = chunk(corpus)

# 2. Embedding (bi-encoder, 384-dim).
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
chunk_vecs = embedder.encode(chunks, normalize_embeddings=True)

# 3. Indexing (FAISS flat IP — swap for HNSW at scale).
index = faiss.IndexFlatIP(chunk_vecs.shape[1])
index.add(chunk_vecs.astype("float32"))

# 4. Retrieval + cross-encoder reranking.
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
def retrieve(query: str, k_first: int = 50, k_final: int = 5) -> list[str]:
    q = embedder.encode([query], normalize_embeddings=True).astype("float32")
    _, ids = index.search(q, k_first)
    cands = [chunks[i] for i in ids[0]]
    scores = reranker.predict([(query, c) for c in cands])
    ranked = sorted(zip(scores, cands), key=lambda x: x[0], reverse=True)
    return [c for _, c in ranked[:k_final]]

# 5. Generation against any OpenAI-compatible endpoint.
client = OpenAI(base_url=os.environ["OPENAI_BASE_URL"], api_key=os.environ["OPENAI_API_KEY"])
def answer(question: str) -> str:
    ctx = retrieve(question)
    prompt = (
        "Answer the question using ONLY the context. If the context does not "
        "contain the answer, reply 'I don't know.' Cite chunks as [1], [2], ...\n\n"
        + "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(ctx))
        + f"\n\nQuestion: {question}\nAnswer:"
    )
    out = client.chat.completions.create(
        model=os.environ.get("MODEL", "gpt-4o-mini"),
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return out.choices[0].message.content

print(answer("What does the policy say about overtime?"))

Tip: The 'I don't know' guardrail in the prompt is the single most impactful line of code in a basic RAG system. Without it, the model will paper over retrieval gaps with fluent fiction; with it, the failure mode becomes a visible refusal that the eval harness can detect and the user can route to a human.

Variants and architectural choices

Variant	Targets failure mode	Cost added	When to adopt
Hybrid search	Rare-term recall, exact strings	+1 index, ~10 ms fusion	Any corpus with acronyms, codes, identifiers
Cross-encoder reranking	Top-k contains relevant but mis-ranked passages	+20-100 ms per query	Default — single biggest quality lift
HyDE	Question/answer phrasing mismatch	+1 LLM call per query	Zero-shot retrieval on novel domains
ColBERT v2	Out-of-domain single-vector failure	+4-10x index size	Domain shift kills your dense model
Multi-query / RAG-Fusion	Single query misses adjacent answers	+3-5 LLM calls per query	Question-answering, not lookup
Agentic RAG	Multi-hop, multi-store reasoning	+2-10x latency and cost	Questions that need 2+ retrieval rounds
GraphRAG	Global / multi-document questions	Heavy ingest cost, graph store	Corpus-summarisation use cases
CAG (cached context)	Per-query retrieval latency	Whole corpus in KV cache	Small, slow-changing corpus, long-context model

Where RAG is used today

Trade-offs and the long-context-vs-RAG debate

Warning: If you cannot measure retrieval Recall@k separately from end-to-end answer quality, you cannot tell whether a bad answer is the retriever's fault or the generator's fault. Build the labelled (query, relevant chunk) eval set before tuning anything else — every later decision depends on it.

Practical implementation notes

Default starting point: 512-token recursive chunks with 64-token overlap, BGE small embeddings, pgvector HNSW index, BM25 hybrid via ParadeDB, bge-reranker-v2-m3, top-50 retrieve and top-5 rerank, GPT-4o or Claude generation with citation enforcement.
Measure retrieval Recall@10, Recall@50, nDCG@10 separately from answer faithfulness — different teams own the two fixes.
Re-encoding the corpus when the embedding model changes is unavoidable; size your ingest pipeline for it.
Per-document ACLs must be enforced at the retrieval layer, not at the LLM layer. The LLM cannot un-see a chunk you passed it.
Cache aggressively: a per-user query cache, a per-chunk embedding cache, and a prefix cache on the LLM serving engine all stack.

Retrieval-Augmented Generation (RAG)

Overview

How it works: the five-stage pipeline

Quick illustration: a 50-line RAG pipeline in Python

Variants and architectural choices

Where RAG is used today

Trade-offs and the long-context-vs-RAG debate

Practical implementation notes

When RAG is the wrong tool

Where RAG fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Retrieval-Augmented Generation (RAG)

Overview

How it works: the five-stage pipeline

Quick illustration: a 50-line RAG pipeline in Python

Variants and architectural choices

Where RAG is used today

Trade-offs and the long-context-vs-RAG debate

Practical implementation notes

When RAG is the wrong tool

Where RAG fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte