Capability · Framework — evals

DeepEval

DeepEval takes a pytest-native approach to LLM evaluation. You write test cases like unit tests, pick metrics from a library of 40+ (including faithfulness, answer-relevancy, context-precision, hallucination, bias, toxicity), and run them in CI. It also ships a red-teaming module for automated adversarial testing. For teams who already live in pytest, DeepEval is the shortest path to production-grade evals.

Framework facts

Category
evals
Language
Python
License
Apache 2.0
Repository
https://github.com/confident-ai/deepeval

Install

pip install -U deepeval

Quickstart

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

tc = LLMTestCase(
    input='What is MCP?',
    actual_output='A wire protocol for LLM tools.',
    retrieval_context=['MCP is a JSON-RPC protocol...']
)
evaluate([tc], [FaithfulnessMetric(threshold=0.7)])

Alternatives

  • Ragas — RAG-specialised evals
  • Inspect AI — UK AISI's framework
  • Braintrust — commercial platform
  • LangSmith evals — LangChain-native

Frequently asked questions

DeepEval or Ragas?

Ragas is more RAG-focused with battle-tested RAG metrics. DeepEval is broader — safety, red-teaming, general LLM output checks — and has a cleaner pytest integration. Teams often use both: Ragas for RAG quality, DeepEval for the wider app.

Do I need Confident AI?

No — DeepEval is fully usable offline with local logs. Confident AI (their SaaS) adds dashboards, hosted datasets, and team collaboration. Start with the OSS library and upgrade only when you need shared eval infrastructure.

Sources

  1. DeepEval — docs — accessed 2026-04-20
  2. DeepEval on GitHub — accessed 2026-04-20