Capability · Framework — rag

Ragas

Ragas provides a focused set of metrics and test-set tools specifically for evaluating retrieval-augmented and agentic LLM apps. Core metrics include faithfulness, answer relevancy, context precision, context recall, and a family of agent metrics (tool-call accuracy, topic adherence). It integrates with LangChain, LlamaIndex, Haystack, and LangSmith, and supports both LLM-as-judge and embedding-based scoring.

Framework facts

Category: rag
Language: Python
License: Apache 2.0
Repository: https://github.com/explodinggradients/ragas

Install

pip install ragas

Quickstart

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

ds = Dataset.from_dict({
    'question': ['What is RAG?'],
    'answer': ['Retrieval-augmented generation.'],
    'contexts': [['RAG retrieves documents before generation.']],
    'ground_truth': ['RAG retrieves documents before generation.']
})
result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_precision])

Alternatives

DeepEval — broader LLM test framework
TruLens — tracing + evals
LangSmith Evals — managed alternative
Phoenix by Arize — OTel-based evals

Frequently asked questions

Do I need ground-truth answers to use Ragas?

Some metrics (context recall, answer correctness) need ground truth. Others (faithfulness, answer relevancy, context precision via LLM judge) don't — they reason over the generated answer and retrieved context directly.

Ragas or DeepEval?

Ragas is RAG-focused and has battle-tested metrics that many teams standardise on. DeepEval is broader (pytest-like harness, more metric families, bias/toxicity). Many teams use both — Ragas for retrieval quality, DeepEval for integration tests.

Sources

Ragas — docs — accessed 2026-04-20
Ragas — GitHub — accessed 2026-04-20