Capability · Framework — evals
BIG-Bench Hard (BBH)
BBH is the canonical hard-reasoning subset of BIG-Bench, introduced by Suzgun et al. It isolates 23 tasks (boolean expressions, causal judgement, logical deduction, tracking shuffled objects, etc.) where previous LLMs under-performed human raters. Post-CoT, BBH became the standard way to show reasoning gains from scaling or new prompting techniques.
Framework facts
- Category
- evals
- Language
- JSON datasets (language-agnostic)
- License
- Apache-2.0
- Repository
- https://github.com/suzgunmirac/BIG-Bench-Hard
Install
git clone https://github.com/suzgunmirac/BIG-Bench-Hard.git
# or via Hugging Face Datasets:
pip install datasets
python -c "from datasets import load_dataset; print(load_dataset('lukaemon/bbh','boolean_expressions'))" Quickstart
from datasets import load_dataset
ds = load_dataset('lukaemon/bbh', 'tracking_shuffled_objects_seven_objects')
print(ds['test'][0])
# evaluate your model by comparing its answer to 'target' Alternatives
- MMLU — broad multi-subject knowledge
- GSM8K — grade-school math reasoning
- MATH / AIME — harder math reasoning
- LogiQA — formal logic reasoning
Frequently asked questions
Is BBH still a useful benchmark?
Frontier models now score 85-95% on BBH, so it no longer separates the best systems. But it remains a useful sanity check for smaller models and is reported in most LLM papers.
How do I evaluate CoT vs non-CoT on BBH?
Run the dataset twice: once with a direct-answer prompt and once with a 'Let's think step by step' prefix. Compare accuracy per sub-task. Most papers report both numbers.
Sources
- BBH paper — accessed 2026-04-20
- BBH GitHub — accessed 2026-04-20