Capability · Framework — evals

Giskard

Giskard started as a classical ML testing tool (tabular, NLP) and extended to LLMs with a 'scan' that runs automated probes for hallucinations, prompt injection, discriminatory behaviour, and off-topic responses. It produces a test suite you can plug into CI. The LLM-specific Hub (SaaS) adds continuous monitoring, human annotation, and red-team simulations, but the OSS library covers most evaluation needs.

Framework facts

Category
evals
Language
Python
License
Apache 2.0 + commercial
Repository
https://github.com/Giskard-AI/giskard

Install

pip install 'giskard[llm]'

Quickstart

import giskard
from giskard.llm import LLMModel

def predict(df):
    return [my_agent(q) for q in df['question']]

model = LLMModel(
    model=predict, name='RAG bot',
    description='Answers questions from docs',
    feature_names=['question']
)
report = giskard.scan(model)
report.to_html('report.html')

Alternatives

  • DeepEval — pytest-first alternative
  • Inspect AI — eval-focused with safety tests
  • Promptfoo — lightweight YAML eval runner
  • PyRIT — Microsoft's red-teaming toolkit

Frequently asked questions

Is it only for LLMs?

No — Giskard's original scope is classical ML, including tabular models, and that still works. The LLM scan is a later addition built on the same test-suite framework, which is an advantage if you're evaluating both classical and generative AI.

How does the automated scan work?

It generates adversarial and probing test cases using an LLM judge and its own heuristics, then scores responses. It's a good starting point but should be paired with domain-specific tests — no automated scan covers every edge case.

Sources

  1. Giskard — docs — accessed 2026-04-20
  2. Giskard on GitHub — accessed 2026-04-20