Capability · Framework — evals

OpenCompass

OpenCompass is one of the most comprehensive open-source eval suites: 100+ datasets (MMLU, CMMLU, C-Eval, HumanEval, GSM8K, math, code, safety) with a flexible configuration system, distributed inference, and a public CompassRank leaderboard. It's particularly strong on Chinese-language benchmarks.

Framework facts

Category
evals
Language
Python
License
Apache-2.0
Repository
https://github.com/open-compass/opencompass

Install

git clone https://github.com/open-compass/opencompass.git
cd opencompass && pip install -e .

Quickstart

# configs/eval_demo.py already exists; override models/datasets:
python run.py --models hf_internlm2_7b --datasets ceval_gen -w ./outputs
# see results at ./outputs/*/summary/*.csv

Alternatives

  • HELM — Stanford's holistic suite
  • lm-eval-harness — EleutherAI standard
  • PromptBench — robustness-focused
  • BIG-Bench / BBH — specific task batteries

Frequently asked questions

Is OpenCompass good for non-Chinese models?

Yes — while its Chinese-language datasets are a strength, it covers the standard English benchmarks (MMLU, GSM8K, HumanEval, BBH, ARC, HellaSwag) too.

How does CompassRank choose its scores?

CompassRank aggregates OpenCompass runs across a fixed dataset bundle with identical prompts and decoding. You submit model weights or an API endpoint and the team reruns the pipeline for transparency.

Sources

  1. OpenCompass — docs — accessed 2026-04-20
  2. OpenCompass GitHub — accessed 2026-04-20