Capability · Framework — evals

MLflow LLM Evaluate

MLflow extended its evaluation API to cover LLMs: you pass a model, a dataset, and a list of metrics, and mlflow.evaluate() returns a results object logged to your MLflow tracking server. It's a natural fit for organisations already running MLflow for classic ML.

Framework facts

Category
evals
Language
Python
License
Apache-2.0
Repository
https://github.com/mlflow/mlflow

Install

pip install 'mlflow>=2.21' openai

Quickstart

import mlflow
import pandas as pd

data = pd.DataFrame({
    'inputs': ['What is VSET?'],
    'ground_truth': ['A Delhi engineering school affiliated to GGSIPU.'],
})
with mlflow.start_run():
    results = mlflow.evaluate(
        model='openai:/gpt-4o-mini',
        data=data,
        targets='ground_truth',
        model_type='question-answering',
    )
print(results.metrics)

Alternatives

  • TruLens — RAG feedback functions
  • Promptfoo — YAML eval sweeps
  • Ragas — RAG-specific metrics
  • Weights & Biases Weave

Frequently asked questions

Why use MLflow for LLM evals?

If your team already uses MLflow for classic ML, LLM evaluate keeps everything in one registry — same runs, same artifacts, same RBAC — without introducing a new tool.

Does it support RAG evaluation?

Yes. Use the 'retrievers' helpers and the faithfulness metric to score RAG systems. Databricks' MLflow builds expand this further.

Sources

  1. MLflow LLM Evaluate — docs — accessed 2026-04-20
  2. MLflow — GitHub — accessed 2026-04-20