Benchmarking LLM performance

Benchmarking LLM performance is the systematic measurement of inference speed, throughput, and quality — using standardized tests to measure time-to-first-token (TTFT), tokens-per-second, concurrent capacity, and response quality, enabling informed decisions about model selection, infrastructure sizing, and optimization priorities.

What Is LLM Benchmarking?

- Definition: Measuring LLM system performance under controlled conditions.
- Metrics: Latency, throughput, quality, cost.
- Purpose: Compare options, identify bottlenecks, validate optimizations.
- Types: Synthetic load tests and real-world workload simulations.

Why Benchmarking Matters

- Model Selection: Choose between GPT-4o, Claude, Llama based on data.
- Capacity Planning: Know how many GPUs needed for target load.
- Optimization: Measure impact of changes.
- SLA Validation: Ensure system meets latency requirements.
- Cost Analysis: Understand cost-per-query at different scales.

Key Performance Metrics

Latency Metrics:
``TTFT (Time to First Token): - Measures prefill latency - Target: <500ms for interactive - Critical for perceived responsiveness

TPOT (Time Per Output Token): - Decode latency per token - Target: <50ms for smooth streaming - Lower = faster generation

E2E (End-to-End): - Total response time - E2E = TTFT + (TPOT × output_tokens)`

Throughput Metrics:`Tokens/Second: - Total generation throughput - Maximized for batch workloads

Requests/Second: - Completed requests per second - Depends on response length

Concurrent Users: - Simultaneous active requests - Limited by memory (KV cache)`

Percentile Latencies:`P50: Median latency (typical experience) P95: 95th percentile (most users) P99: 99th percentile (worst common case) Max: Absolute worst case

Target: P99 < 2× P50 for consistent experience`

Benchmarking Tools

`Tool | Type | Features ------------|----------------|------------------------- LLMPerf | LLM-specific | TTFT, TPOT, concurrency k6 | Load testing | Flexible scripting Locust | Load testing | Python-based, distributed hey | HTTP benchmark | Simple, quick tests wrk | HTTP benchmark | High performance Custom | Any | Precise control`

Simple Benchmark Script:`python import time import statistics from openai import OpenAI

client = OpenAI()

def benchmark_request(prompt): start = time.time() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], stream=True ) first_token_time = None token_count = 0 for chunk in response: if first_token_time is None: first_token_time = time.time() if chunk.choices[0].delta.content: token_count += 1 end = time.time() return { "ttft": first_token_time - start, "total_time": end - start, "tokens": token_count, "tpot": (end - first_token_time) / token_count }

# Run multiple iterations results = [benchmark_request("Explain quantum computing") for _ in range(10)]

# Calculate statistics ttfts = [r["ttft"] for r in results] print(f"TTFT P50: {statistics.median(ttfts):.3f}s") print(f"TTFT P95: {sorted(ttfts)[int(len(ttfts)*0.95)]:.3f}s")`

Load Testing with Locust:`python from locust import HttpUser, task, between

class LLMUser(HttpUser): wait_time = between(1, 3) @task def generate_response(self): self.client.post( "/v1/chat/completions", json={ "model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}] }, headers={"Authorization": "Bearer ..."} )`

Benchmark Methodology

`┌─────────────────────────────────────────────────────┐ │ 1. Define Test Scenarios │ │ - Realistic prompts (varied lengths) │ │ - Expected output lengths │ │ - Concurrency patterns │ ├─────────────────────────────────────────────────────┤ │ 2. Establish Baseline │ │ - Warm up system │ │ - Run baseline at low load │ │ - Record all metrics │ ├─────────────────────────────────────────────────────┤ │ 3. Stress Test │ │ - Gradually increase load │ │ - Find breaking point │ │ - Identify bottleneck │ ├─────────────────────────────────────────────────────┤ │ 4. Analyze Results │ │ - Plot latency vs. load │ │ - Calculate cost per request │ │ - Compare to requirements │ └─────────────────────────────────────────────────────┘``

Best Practices

- Warm Up: Run requests before measuring to warm caches.
- Realistic Load: Use production-like prompt distributions.
- Sufficient Duration: Run long enough for stable results.
- Monitor System: Watch GPU utilization, memory during test.
- Multiple Runs: Account for variance in results.
- Document Everything: Record versions, configurations, conditions.

Benchmarking LLM performance is essential for production planning — without rigorous measurement, teams make infrastructure decisions based on hope rather than data, leading to either overspending or underprovisioning that impacts user experience.

Want to learn more?