Benchmarking LLM performance is the systematic measurement of inference speed, throughput, and quality — using standardized tests to measure time-to-first-token (TTFT), tokens-per-second, concurrent capacity, and response quality, enabling informed decisions about model selection, infrastructure sizing, and optimization priorities.
What Is LLM Benchmarking?
- Definition: Measuring LLM system performance under controlled conditions.
- Metrics: Latency, throughput, quality, cost.
- Purpose: Compare options, identify bottlenecks, validate optimizations.
- Types: Synthetic load tests and real-world workload simulations.
Why Benchmarking Matters
- Model Selection: Choose between GPT-4o, Claude, Llama based on data.
- Capacity Planning: Know how many GPUs needed for target load.
- Optimization: Measure impact of changes.
- SLA Validation: Ensure system meets latency requirements.
- Cost Analysis: Understand cost-per-query at different scales.
Key Performance Metrics
Latency Metrics:
``
TTFT (Time to First Token):
- Measures prefill latency
- Target: <500ms for interactive
- Critical for perceived responsiveness
TPOT (Time Per Output Token):
- Decode latency per token
- Target: <50ms for smooth streaming
- Lower = faster generation
E2E (End-to-End):
- Total response time
- E2E = TTFT + (TPOT × output_tokens)
`
Throughput Metrics:
`
Tokens/Second:
- Total generation throughput
- Maximized for batch workloads
Requests/Second:
- Completed requests per second
- Depends on response length
Concurrent Users:
- Simultaneous active requests
- Limited by memory (KV cache)
`
Percentile Latencies:
`
P50: Median latency (typical experience)
P95: 95th percentile (most users)
P99: 99th percentile (worst common case)
Max: Absolute worst case
Target: P99 < 2× P50 for consistent experience
`
Benchmarking Tools
``
Tool | Type | Features
------------|----------------|-------------------------
LLMPerf | LLM-specific | TTFT, TPOT, concurrency
k6 | Load testing | Flexible scripting
Locust | Load testing | Python-based, distributed
hey | HTTP benchmark | Simple, quick tests
wrk | HTTP benchmark | High performance
Custom | Any | Precise control
Simple Benchmark Script:
`python
import time
import statistics
from openai import OpenAI
client = OpenAI()
def benchmark_request(prompt):
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
first_token_time = None
token_count = 0
for chunk in response:
if first_token_time is None:
first_token_time = time.time()
if chunk.choices[0].delta.content:
token_count += 1
end = time.time()
return {
"ttft": first_token_time - start,
"total_time": end - start,
"tokens": token_count,
"tpot": (end - first_token_time) / token_count
}
# Run multiple iterations
results = [benchmark_request("Explain quantum computing") for _ in range(10)]
# Calculate statistics
ttfts = [r["ttft"] for r in results]
print(f"TTFT P50: {statistics.median(ttfts):.3f}s")
print(f"TTFT P95: {sorted(ttfts)[int(len(ttfts)*0.95)]:.3f}s")
`
Load Testing with Locust:
`python
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(1, 3)
@task
def generate_response(self):
self.client.post(
"/v1/chat/completions",
json={
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
},
headers={"Authorization": "Bearer ..."}
)
`
Benchmark Methodology
```
┌─────────────────────────────────────────────────────┐
│ 1. Define Test Scenarios │
│ - Realistic prompts (varied lengths) │
│ - Expected output lengths │
│ - Concurrency patterns │
├─────────────────────────────────────────────────────┤
│ 2. Establish Baseline │
│ - Warm up system │
│ - Run baseline at low load │
│ - Record all metrics │
├─────────────────────────────────────────────────────┤
│ 3. Stress Test │
│ - Gradually increase load │
│ - Find breaking point │
│ - Identify bottleneck │
├─────────────────────────────────────────────────────┤
│ 4. Analyze Results │
│ - Plot latency vs. load │
│ - Calculate cost per request │
│ - Compare to requirements │
└─────────────────────────────────────────────────────┘
Best Practices
- Warm Up: Run requests before measuring to warm caches.
- Realistic Load: Use production-like prompt distributions.
- Sufficient Duration: Run long enough for stable results.
- Monitor System: Watch GPU utilization, memory during test.
- Multiple Runs: Account for variance in results.
- Document Everything: Record versions, configurations, conditions.
Benchmarking LLM performance is essential for production planning — without rigorous measurement, teams make infrastructure decisions based on hope rather than data, leading to either overspending or underprovisioning that impacts user experience.