Capability · Framework — evals
DeepEval
DeepEval takes a pytest-native approach to LLM evaluation. You write test cases like unit tests, pick metrics from a library of 40+ (including faithfulness, answer-relevancy, context-precision, hallucination, bias, toxicity), and run them in CI. It also ships a red-teaming module for automated adversarial testing. For teams who already live in pytest, DeepEval is the shortest path to production-grade evals.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache 2.0
- Repository
- https://github.com/confident-ai/deepeval
Install
pip install -U deepeval Quickstart
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
tc = LLMTestCase(
input='What is MCP?',
actual_output='A wire protocol for LLM tools.',
retrieval_context=['MCP is a JSON-RPC protocol...']
)
evaluate([tc], [FaithfulnessMetric(threshold=0.7)]) Alternatives
- Ragas — RAG-specialised evals
- Inspect AI — UK AISI's framework
- Braintrust — commercial platform
- LangSmith evals — LangChain-native
Frequently asked questions
DeepEval or Ragas?
Ragas is more RAG-focused with battle-tested RAG metrics. DeepEval is broader — safety, red-teaming, general LLM output checks — and has a cleaner pytest integration. Teams often use both: Ragas for RAG quality, DeepEval for the wider app.
Do I need Confident AI?
No — DeepEval is fully usable offline with local logs. Confident AI (their SaaS) adds dashboards, hosted datasets, and team collaboration. Start with the OSS library and upgrade only when you need shared eval infrastructure.
Sources
- DeepEval — docs — accessed 2026-04-20
- DeepEval on GitHub — accessed 2026-04-20