TL;DR
- Open-source Python framework for evaluating RAG pipelines, originating with Es et al. (2023, arXiv:2309.15217) at Exploding Gradients and now maintained as a broader RAG evaluation library.
- Defines a small, principled metric set: faithfulness, answer relevancy, context precision, context recall, and context entities recall — each implemented as an LLM-as-judge metric with a documented prompting strategy.
- Designed to be reference-free where possible — most Ragas metrics do not require a hand-annotated ground-truth answer, only the question, the retrieved context, and the generated answer.
- Now ships test-set generation, multi-turn evaluation, and an integration with Langfuse, LangSmith, and Phoenix for trace-driven evaluation.
Background#
Ragas was introduced by Shahul Es and colleagues at Exploding Gradients in their 2023 paper "Ragas: Automated Evaluation of Retrieval Augmented Generation" (arXiv:2309.15217). The paper made two arguments: that RAG evaluation needed metrics specific to the retrieval-plus-generation pattern (not just generic LLM metrics), and that those metrics could be implemented as LLM-as-judge graders without requiring per-question reference answers.
The framework's adoption was rapid because it filled an obvious gap: RAG was suddenly the dominant LLM application pattern in 2023-2024, and existing benchmarks (BLEU, ROUGE, BERTScore) measured the wrong thing for it.
The Core Metric Set#
Faithfulness and Answer Relevancy are the reference-free workhorses — runnable on any RAG output without hand annotation. Context Precision and Recall require ground-truth answers (which Ragas's test-set generator can synthesise) and are useful for tuning the retriever specifically.
| Metric | What it measures | Requires reference? |
|---|---|---|
| Faithfulness | Are answer claims supported by retrieved context | No |
| Answer Relevancy | Does the answer address the question | No |
| Context Precision | Are relevant chunks ranked highly | Yes (ground-truth) |
| Context Recall | Did retrieval find the necessary context | Yes (ground-truth) |
| Context Entities Recall | Are key entities present in context | Yes (ground-truth) |
| Noise Sensitivity | Does the model misuse irrelevant context | No |
How Faithfulness Works#
Faithfulness is computed in two LLM-judge stages. First, the answer is decomposed into atomic claims — short factual statements. Second, each claim is checked against the retrieved context: is it directly supported, contradicted, or unsupported. Faithfulness is the proportion of claims supported by the context.
This per-claim approach is more diagnostic than a single overall judgement because it tells you which specific statements are hallucinated. Combined with the retrieved context for each example, it points at whether the failure is a retriever problem (context missing the answer) or a generator problem (model ignoring or contradicting context).
Test-Set Generation#
Ragas's test-set generator (`TestsetGenerator`) takes a corpus of documents and produces a synthetic evaluation dataset with question types of varying difficulty — simple, reasoning, multi-context, and conditional. The synthesiser uses a graph over the corpus to plan questions that span multiple chunks and require non-trivial retrieval.
Like all synthesised datasets, these are useful for coverage and for bootstrapping; they do not replace human-curated questions for assessing real-world quality. Use them as the bulk of an early evaluation set and gradually replace them with human-curated questions as production traffic accumulates.
Ragas metrics use a judge LLM that you configure. The choice of judge matters — use a strong frontier model (Claude Opus, GPT-4 class) for production scoring, not the same small model you serve in your application. Judge quality dominates the noise floor.
When to Pick Ragas#
Pick Ragas when your application is primarily RAG and you want the most principled RAG-specific metric set available. Use it alongside, not instead of, a general framework like DeepEval if your application has non-RAG components (tool use, multi-turn agents). Ragas integrates with Langfuse, LangSmith, and Phoenix for trace-driven evaluation, so it slots into existing observability cleanly.
References
- Ragas: Automated Evaluation of Retrieval Augmented Generation · arXiv (Es et al., 2023)
- Ragas Documentation · Ragas
- Ragas on GitHub · GitHub