Capability · Framework — evals
PromptBench
PromptBench from Microsoft Research packages together benchmark tasks (from MMLU to GSM8K), adversarial prompt attacks, prompt-engineering variants (CoT, few-shot), and dynamic evaluation into a single Python API. It's popular for robustness studies — quantifying how much a model's accuracy drops when prompts are perturbed, paraphrased, or attacked.
Framework facts
- Category
- evals
- Language
- Python
- License
- MIT
- Repository
- https://github.com/microsoft/promptbench
Install
pip install promptbench Quickstart
import promptbench as pb
dataset = pb.DatasetLoader.load_dataset('mmlu')
model = pb.LLMModel(model='gpt-4o-mini', api_key='sk-...')
prompt = pb.Prompt(['You are a careful examiner.', 'Question: {content}'])
result = pb.InferencePipeline(model, prompt, dataset).inference()
print(pb.metrics.Eval.compute_cls_accuracy(result, dataset)) Alternatives
- HELM — Stanford's holistic evaluation
- OpenAI evals — OpenAI's library
- OpenCompass — comprehensive Chinese-led eval suite
- lm-eval-harness — EleutherAI standard for benchmarks
Frequently asked questions
What's unique about PromptBench?
Robustness. It specifically includes adversarial prompt attacks (character-, word-, sentence-, semantic-level) so you can measure the gap between best-case and worst-case accuracy.
Does it run local models?
Yes — use the `pb.LLMModel` wrapper for OpenAI, Anthropic, local Hugging Face, or vLLM endpoints.
Sources
- PromptBench GitHub — accessed 2026-04-20
- PromptBench paper — accessed 2026-04-20