Capability · Framework — evals

BIG-Bench Hard (BBH)

BBH is the canonical hard-reasoning subset of BIG-Bench, introduced by Suzgun et al. It isolates 23 tasks (boolean expressions, causal judgement, logical deduction, tracking shuffled objects, etc.) where previous LLMs under-performed human raters. Post-CoT, BBH became the standard way to show reasoning gains from scaling or new prompting techniques.

Framework facts

Category
evals
Language
JSON datasets (language-agnostic)
License
Apache-2.0
Repository
https://github.com/suzgunmirac/BIG-Bench-Hard

Install

git clone https://github.com/suzgunmirac/BIG-Bench-Hard.git
# or via Hugging Face Datasets:
pip install datasets
python -c "from datasets import load_dataset; print(load_dataset('lukaemon/bbh','boolean_expressions'))"

Quickstart

from datasets import load_dataset

ds = load_dataset('lukaemon/bbh', 'tracking_shuffled_objects_seven_objects')
print(ds['test'][0])
# evaluate your model by comparing its answer to 'target'

Alternatives

  • MMLU — broad multi-subject knowledge
  • GSM8K — grade-school math reasoning
  • MATH / AIME — harder math reasoning
  • LogiQA — formal logic reasoning

Frequently asked questions

Is BBH still a useful benchmark?

Frontier models now score 85-95% on BBH, so it no longer separates the best systems. But it remains a useful sanity check for smaller models and is reported in most LLM papers.

How do I evaluate CoT vs non-CoT on BBH?

Run the dataset twice: once with a direct-answer prompt and once with a 'Let's think step by step' prefix. Compare accuracy per sub-task. Most papers report both numbers.

Sources

  1. BBH paper — accessed 2026-04-20
  2. BBH GitHub — accessed 2026-04-20