Capability · Framework — evals

DeepEval

DeepEval takes a pytest-native approach to LLM evaluation. You write test cases like unit tests, pick metrics from a library of 40+ (including faithfulness, answer-relevancy, context-precision, hallucination, bias, toxicity), and run them in CI. It also ships a red-teaming module for automated adversarial testing. For teams who already live in pytest, DeepEval is the shortest path to production-grade evals.

Framework facts

Category: evals
Language: Python
License: Apache 2.0
Repository: https://github.com/confident-ai/deepeval

Install

pip install -U deepeval

Quickstart

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

tc = LLMTestCase(
    input='What is MCP?',
    actual_output='A wire protocol for LLM tools.',
    retrieval_context=['MCP is a JSON-RPC protocol...']
)
evaluate([tc], [FaithfulnessMetric(threshold=0.7)])

Alternatives

Ragas — RAG-specialised evals
Inspect AI — UK AISI's framework
Braintrust — commercial platform
LangSmith evals — LangChain-native

Frequently asked questions

DeepEval or Ragas?

Ragas is more RAG-focused with battle-tested RAG metrics. DeepEval is broader — safety, red-teaming, general LLM output checks — and has a cleaner pytest integration. Teams often use both: Ragas for RAG quality, DeepEval for the wider app.

Do I need Confident AI?

No — DeepEval is fully usable offline with local logs. Confident AI (their SaaS) adds dashboards, hosted datasets, and team collaboration. Start with the OSS library and upgrade only when you need shared eval infrastructure.

Sources

DeepEval — docs — accessed 2026-04-20
DeepEval on GitHub — accessed 2026-04-20