Capability · Framework — evals

Giskard

Giskard started as a classical ML testing tool (tabular, NLP) and extended to LLMs with a 'scan' that runs automated probes for hallucinations, prompt injection, discriminatory behaviour, and off-topic responses. It produces a test suite you can plug into CI. The LLM-specific Hub (SaaS) adds continuous monitoring, human annotation, and red-team simulations, but the OSS library covers most evaluation needs.

Framework facts

Category: evals
Language: Python
License: Apache 2.0 + commercial
Repository: https://github.com/Giskard-AI/giskard

Install

pip install 'giskard[llm]'

Quickstart

import giskard
from giskard.llm import LLMModel

def predict(df):
    return [my_agent(q) for q in df['question']]

model = LLMModel(
    model=predict, name='RAG bot',
    description='Answers questions from docs',
    feature_names=['question']
)
report = giskard.scan(model)
report.to_html('report.html')

Alternatives

DeepEval — pytest-first alternative
Inspect AI — eval-focused with safety tests
Promptfoo — lightweight YAML eval runner
PyRIT — Microsoft's red-teaming toolkit

Frequently asked questions

Is it only for LLMs?

No — Giskard's original scope is classical ML, including tabular models, and that still works. The LLM scan is a later addition built on the same test-suite framework, which is an advantage if you're evaluating both classical and generative AI.

How does the automated scan work?

It generates adversarial and probing test cases using an LLM judge and its own heuristics, then scores responses. It's a good starting point but should be paired with domain-specific tests — no automated scan covers every edge case.

Sources

Giskard — docs — accessed 2026-04-20
Giskard on GitHub — accessed 2026-04-20