Capability · Framework — evals

PromptBench

PromptBench from Microsoft Research packages together benchmark tasks (from MMLU to GSM8K), adversarial prompt attacks, prompt-engineering variants (CoT, few-shot), and dynamic evaluation into a single Python API. It's popular for robustness studies — quantifying how much a model's accuracy drops when prompts are perturbed, paraphrased, or attacked.

Framework facts

Category
evals
Language
Python
License
MIT
Repository
https://github.com/microsoft/promptbench

Install

pip install promptbench

Quickstart

import promptbench as pb

dataset = pb.DatasetLoader.load_dataset('mmlu')
model = pb.LLMModel(model='gpt-4o-mini', api_key='sk-...')
prompt = pb.Prompt(['You are a careful examiner.', 'Question: {content}'])
result = pb.InferencePipeline(model, prompt, dataset).inference()
print(pb.metrics.Eval.compute_cls_accuracy(result, dataset))

Alternatives

  • HELM — Stanford's holistic evaluation
  • OpenAI evals — OpenAI's library
  • OpenCompass — comprehensive Chinese-led eval suite
  • lm-eval-harness — EleutherAI standard for benchmarks

Frequently asked questions

What's unique about PromptBench?

Robustness. It specifically includes adversarial prompt attacks (character-, word-, sentence-, semantic-level) so you can measure the gap between best-case and worst-case accuracy.

Does it run local models?

Yes — use the `pb.LLMModel` wrapper for OpenAI, Anthropic, local Hugging Face, or vLLM endpoints.

Sources

  1. PromptBench GitHub — accessed 2026-04-20
  2. PromptBench paper — accessed 2026-04-20