Ragas

TL;DR

Open-source Python framework for evaluating RAG pipelines, originating with Es et al. (2023, arXiv:2309.15217) at Exploding Gradients and now maintained as a broader RAG evaluation library.
Defines a small, principled metric set: faithfulness, answer relevancy, context precision, context recall, and context entities recall — each implemented as an LLM-as-judge metric with a documented prompting strategy.
Designed to be reference-free where possible — most Ragas metrics do not require a hand-annotated ground-truth answer, only the question, the retrieved context, and the generated answer.
Now ships test-set generation, multi-turn evaluation, and an integration with Langfuse, LangSmith, and Phoenix for trace-driven evaluation.

Background#

Ragas was introduced by Shahul Es and colleagues at Exploding Gradients in their 2023 paper "Ragas: Automated Evaluation of Retrieval Augmented Generation" (arXiv:2309.15217). The paper made two arguments: that RAG evaluation needed metrics specific to the retrieval-plus-generation pattern (not just generic LLM metrics), and that those metrics could be implemented as LLM-as-judge graders without requiring per-question reference answers.

The framework's adoption was rapid because it filled an obvious gap: RAG was suddenly the dominant LLM application pattern in 2023-2024, and existing benchmarks (BLEU, ROUGE, BERTScore) measured the wrong thing for it.

The Core Metric Set#

Faithfulness and Answer Relevancy are the reference-free workhorses — runnable on any RAG output without hand annotation. Context Precision and Recall require ground-truth answers (which Ragas's test-set generator can synthesise) and are useful for tuning the retriever specifically.

Metric	What it measures	Requires reference?
Faithfulness	Are answer claims supported by retrieved context	No
Answer Relevancy	Does the answer address the question	No
Context Precision	Are relevant chunks ranked highly	Yes (ground-truth)
Context Recall	Did retrieval find the necessary context	Yes (ground-truth)
Context Entities Recall	Are key entities present in context	Yes (ground-truth)
Noise Sensitivity	Does the model misuse irrelevant context	No

How Faithfulness Works#

Faithfulness is computed in two LLM-judge stages. First, the answer is decomposed into atomic claims — short factual statements. Second, each claim is checked against the retrieved context: is it directly supported, contradicted, or unsupported. Faithfulness is the proportion of claims supported by the context.

This per-claim approach is more diagnostic than a single overall judgement because it tells you which specific statements are hallucinated. Combined with the retrieved context for each example, it points at whether the failure is a retriever problem (context missing the answer) or a generator problem (model ignoring or contradicting context).

Test-Set Generation#

Ragas's test-set generator (`TestsetGenerator`) takes a corpus of documents and produces a synthetic evaluation dataset with question types of varying difficulty — simple, reasoning, multi-context, and conditional. The synthesiser uses a graph over the corpus to plan questions that span multiple chunks and require non-trivial retrieval.

Like all synthesised datasets, these are useful for coverage and for bootstrapping; they do not replace human-curated questions for assessing real-world quality. Use them as the bulk of an early evaluation set and gradually replace them with human-curated questions as production traffic accumulates.

Ragas metrics use a judge LLM that you configure. The choice of judge matters — use a strong frontier model (Claude Opus, GPT-4 class) for production scoring, not the same small model you serve in your application. Judge quality dominates the noise floor.

When to Pick Ragas#

Pick Ragas when your application is primarily RAG and you want the most principled RAG-specific metric set available. Use it alongside, not instead of, a general framework like DeepEval if your application has non-RAG components (tool use, multi-turn agents). Ragas integrates with Langfuse, LangSmith, and Phoenix for trace-driven evaluation, so it slots into existing observability cleanly.

References

Ragas: Automated Evaluation of Retrieval Augmented Generation · arXiv (Es et al., 2023)
Ragas Documentation · Ragas
Ragas on GitHub · GitHub

Background#

The Core Metric Set#

Metric	What it measures	Requires reference?
Faithfulness	Are answer claims supported by retrieved context	No
Answer Relevancy	Does the answer address the question	No
Context Precision	Are relevant chunks ranked highly	Yes (ground-truth)
Context Recall	Did retrieval find the necessary context	Yes (ground-truth)
Context Entities Recall	Are key entities present in context	Yes (ground-truth)
Noise Sensitivity	Does the model misuse irrelevant context	No

How Faithfulness Works#

Test-Set Generation#

When to Pick Ragas#

Ragas

Background#

The Core Metric Set#

How Faithfulness Works#

Test-Set Generation#

When to Pick Ragas#

References

Browse all entries

Deploy on Yobitel

Ragas

Background#

The Core Metric Set#

How Faithfulness Works#

Test-Set Generation#

When to Pick Ragas#

References

Browse all entries

Deploy on Yobitel