Capability · Framework — rag
Ragas
Ragas provides a focused set of metrics and test-set tools specifically for evaluating retrieval-augmented and agentic LLM apps. Core metrics include faithfulness, answer relevancy, context precision, context recall, and a family of agent metrics (tool-call accuracy, topic adherence). It integrates with LangChain, LlamaIndex, Haystack, and LangSmith, and supports both LLM-as-judge and embedding-based scoring.
Framework facts
- Category
- rag
- Language
- Python
- License
- Apache 2.0
- Repository
- https://github.com/explodinggradients/ragas
Install
pip install ragas Quickstart
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
ds = Dataset.from_dict({
'question': ['What is RAG?'],
'answer': ['Retrieval-augmented generation.'],
'contexts': [['RAG retrieves documents before generation.']],
'ground_truth': ['RAG retrieves documents before generation.']
})
result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_precision]) Alternatives
- DeepEval — broader LLM test framework
- TruLens — tracing + evals
- LangSmith Evals — managed alternative
- Phoenix by Arize — OTel-based evals
Frequently asked questions
Do I need ground-truth answers to use Ragas?
Some metrics (context recall, answer correctness) need ground truth. Others (faithfulness, answer relevancy, context precision via LLM judge) don't — they reason over the generated answer and retrieved context directly.
Ragas or DeepEval?
Ragas is RAG-focused and has battle-tested metrics that many teams standardise on. DeepEval is broader (pytest-like harness, more metric families, bias/toxicity). Many teams use both — Ragas for retrieval quality, DeepEval for integration tests.
Sources
- Ragas — docs — accessed 2026-04-20
- Ragas — GitHub — accessed 2026-04-20