Capability · Framework — evals

MLflow LLM Evaluate

MLflow extended its evaluation API to cover LLMs: you pass a model, a dataset, and a list of metrics, and mlflow.evaluate() returns a results object logged to your MLflow tracking server. It's a natural fit for organisations already running MLflow for classic ML.

Framework facts

Category: evals
Language: Python
License: Apache-2.0
Repository: https://github.com/mlflow/mlflow

Install

pip install 'mlflow>=2.21' openai

Quickstart

import mlflow
import pandas as pd

data = pd.DataFrame({
    'inputs': ['What is VSET?'],
    'ground_truth': ['A Delhi engineering school affiliated to GGSIPU.'],
})
with mlflow.start_run():
    results = mlflow.evaluate(
        model='openai:/gpt-4o-mini',
        data=data,
        targets='ground_truth',
        model_type='question-answering',
    )
print(results.metrics)

Alternatives

TruLens — RAG feedback functions
Promptfoo — YAML eval sweeps
Ragas — RAG-specific metrics
Weights & Biases Weave

Frequently asked questions

Why use MLflow for LLM evals?

If your team already uses MLflow for classic ML, LLM evaluate keeps everything in one registry — same runs, same artifacts, same RBAC — without introducing a new tool.

Does it support RAG evaluation?

Yes. Use the 'retrievers' helpers and the faithfulness metric to score RAG systems. Databricks' MLflow builds expand this further.

Sources

MLflow LLM Evaluate — docs — accessed 2026-04-20
MLflow — GitHub — accessed 2026-04-20