Capability · Framework — evals
Giskard
Giskard started as a classical ML testing tool (tabular, NLP) and extended to LLMs with a 'scan' that runs automated probes for hallucinations, prompt injection, discriminatory behaviour, and off-topic responses. It produces a test suite you can plug into CI. The LLM-specific Hub (SaaS) adds continuous monitoring, human annotation, and red-team simulations, but the OSS library covers most evaluation needs.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache 2.0 + commercial
- Repository
- https://github.com/Giskard-AI/giskard
Install
pip install 'giskard[llm]' Quickstart
import giskard
from giskard.llm import LLMModel
def predict(df):
return [my_agent(q) for q in df['question']]
model = LLMModel(
model=predict, name='RAG bot',
description='Answers questions from docs',
feature_names=['question']
)
report = giskard.scan(model)
report.to_html('report.html') Alternatives
- DeepEval — pytest-first alternative
- Inspect AI — eval-focused with safety tests
- Promptfoo — lightweight YAML eval runner
- PyRIT — Microsoft's red-teaming toolkit
Frequently asked questions
Is it only for LLMs?
No — Giskard's original scope is classical ML, including tabular models, and that still works. The LLM scan is a later addition built on the same test-suite framework, which is an advantage if you're evaluating both classical and generative AI.
How does the automated scan work?
It generates adversarial and probing test cases using an LLM judge and its own heuristics, then scores responses. It's a good starting point but should be paired with domain-specific tests — no automated scan covers every edge case.
Sources
- Giskard — docs — accessed 2026-04-20
- Giskard on GitHub — accessed 2026-04-20