LLM Benchmarks and Evaluation
Major Benchmark Suites
Knowledge and Reasoning
| Benchmark | Type | Description |
|---|---|---|
| MMLU | Multiple choice | 57 subjects, high school to expert |
| ARC | Multiple choice | Science questions |
| HellaSwag | Completion | Common sense reasoning |
| Winogrande | Coreference | Pronoun resolution |
| TruthfulQA | Open-ended | Truthfulness vs misinformation |
Coding
| Benchmark | Type | Languages |
|---|---|---|
| HumanEval | Code generation | Python |
| MBPP | Code generation | Python |
| MultiPL-E | Multi-language | 18 languages |
| SWE-bench | Real repos | Python |
| CodeContests | Competition | Multi |
Math
| Benchmark | Type | Level |
|---|---|---|
| GSM8K | Word problems | Grade school |
| MATH | Competition | High school |
| Minerva | STEM | College |
Running Benchmarks
Using lm-evaluation-harness
pip install lm-eval
lm_eval --model hf
--model_args pretrained=meta-llama/Llama-2-7b-hf
--tasks mmlu,hellaswag,arc_challenge
--batch_size 8
Using BigCode Eval
# For code benchmarks
accelerate launch main.py
--model meta-llama/Llama-2-7b-hf
--tasks humaneval
--n_samples 20
--temperature 0.2
Typical Scores
| Model | MMLU | HumanEval | GSM8K |
|---|---|---|---|
| GPT-4 | 86.4 | 67.0 | 92.0 |
| Claude 3 Opus | 86.8 | 84.9 | 95.0 |
| Llama 3 70B | 82.0 | 81.7 | 93.0 |
| Gemini Ultra | 83.7 | 74.4 | 94.4 |
Limitations of Benchmarks
| Issue | Description |
|---|---|
| Data contamination | Models may have seen test data |
| Narrow coverage | Dont test all capabilities |
| Gaming | Optimization for benchmarks |
| Real-world gap | Benchmarks != production |
Best Practices
- Use multiple benchmarks
- Consider domain-specific evals
- Track over time
- Supplement with human evaluation
- Watch for contamination
benchmark suitemmluhumaneval
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.