Home Knowledge Base LLM Benchmarks and Evaluation

LLM Benchmarks and Evaluation

Major Benchmark Suites

Knowledge and Reasoning

BenchmarkTypeDescription
MMLUMultiple choice57 subjects, high school to expert
ARCMultiple choiceScience questions
HellaSwagCompletionCommon sense reasoning
WinograndeCoreferencePronoun resolution
TruthfulQAOpen-endedTruthfulness vs misinformation

Coding

BenchmarkTypeLanguages
HumanEvalCode generationPython
MBPPCode generationPython
MultiPL-EMulti-language18 languages
SWE-benchReal reposPython
CodeContestsCompetitionMulti

Math

BenchmarkTypeLevel
GSM8KWord problemsGrade school
MATHCompetitionHigh school
MinervaSTEMCollege

Running Benchmarks

Using lm-evaluation-harness

pip install lm-eval

lm_eval --model hf
    --model_args pretrained=meta-llama/Llama-2-7b-hf
    --tasks mmlu,hellaswag,arc_challenge
    --batch_size 8

Using BigCode Eval

# For code benchmarks
accelerate launch main.py
    --model meta-llama/Llama-2-7b-hf
    --tasks humaneval
    --n_samples 20
    --temperature 0.2

Typical Scores

ModelMMLUHumanEvalGSM8K
GPT-486.467.092.0
Claude 3 Opus86.884.995.0
Llama 3 70B82.081.793.0
Gemini Ultra83.774.494.4

Limitations of Benchmarks

IssueDescription
Data contaminationModels may have seen test data
Narrow coverageDont test all capabilities
GamingOptimization for benchmarks
Real-world gapBenchmarks != production

Best Practices

benchmark suitemmluhumaneval

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.