Capability · Framework — evals
OpenCompass
OpenCompass is one of the most comprehensive open-source eval suites: 100+ datasets (MMLU, CMMLU, C-Eval, HumanEval, GSM8K, math, code, safety) with a flexible configuration system, distributed inference, and a public CompassRank leaderboard. It's particularly strong on Chinese-language benchmarks.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/open-compass/opencompass
Install
git clone https://github.com/open-compass/opencompass.git
cd opencompass && pip install -e . Quickstart
# configs/eval_demo.py already exists; override models/datasets:
python run.py --models hf_internlm2_7b --datasets ceval_gen -w ./outputs
# see results at ./outputs/*/summary/*.csv Alternatives
- HELM — Stanford's holistic suite
- lm-eval-harness — EleutherAI standard
- PromptBench — robustness-focused
- BIG-Bench / BBH — specific task batteries
Frequently asked questions
Is OpenCompass good for non-Chinese models?
Yes — while its Chinese-language datasets are a strength, it covers the standard English benchmarks (MMLU, GSM8K, HumanEval, BBH, ARC, HellaSwag) too.
How does CompassRank choose its scores?
CompassRank aggregates OpenCompass runs across a fixed dataset bundle with identical prompts and decoding. You submit model weights or an API endpoint and the team reruns the pipeline for transparency.
Sources
- OpenCompass — docs — accessed 2026-04-20
- OpenCompass GitHub — accessed 2026-04-20