Capability · Framework — evals
MLflow LLM Evaluate
MLflow extended its evaluation API to cover LLMs: you pass a model, a dataset, and a list of metrics, and mlflow.evaluate() returns a results object logged to your MLflow tracking server. It's a natural fit for organisations already running MLflow for classic ML.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/mlflow/mlflow
Install
pip install 'mlflow>=2.21' openai Quickstart
import mlflow
import pandas as pd
data = pd.DataFrame({
'inputs': ['What is VSET?'],
'ground_truth': ['A Delhi engineering school affiliated to GGSIPU.'],
})
with mlflow.start_run():
results = mlflow.evaluate(
model='openai:/gpt-4o-mini',
data=data,
targets='ground_truth',
model_type='question-answering',
)
print(results.metrics) Alternatives
- TruLens — RAG feedback functions
- Promptfoo — YAML eval sweeps
- Ragas — RAG-specific metrics
- Weights & Biases Weave
Frequently asked questions
Why use MLflow for LLM evals?
If your team already uses MLflow for classic ML, LLM evaluate keeps everything in one registry — same runs, same artifacts, same RBAC — without introducing a new tool.
Does it support RAG evaluation?
Yes. Use the 'retrievers' helpers and the faithfulness metric to score RAG systems. Databricks' MLflow builds expand this further.
Sources
- MLflow LLM Evaluate — docs — accessed 2026-04-20
- MLflow — GitHub — accessed 2026-04-20