Home Knowledge Base Benchmarking LLM performance

Benchmarking LLM performance is the systematic measurement of inference speed, throughput, and quality — using standardized tests to measure time-to-first-token (TTFT), tokens-per-second, concurrent capacity, and response quality, enabling informed decisions about model selection, infrastructure sizing, and optimization priorities.

What Is LLM Benchmarking?

Why Benchmarking Matters

Key Performance Metrics

Latency Metrics:

TTFT (Time to First Token):
- Measures prefill latency
- Target: <500ms for interactive
- Critical for perceived responsiveness

TPOT (Time Per Output Token):
- Decode latency per token
- Target: <50ms for smooth streaming
- Lower = faster generation

E2E (End-to-End):
- Total response time
- E2E = TTFT + (TPOT × output_tokens)

Throughput Metrics:

Tokens/Second:
- Total generation throughput
- Maximized for batch workloads

Requests/Second:
- Completed requests per second
- Depends on response length

Concurrent Users:
- Simultaneous active requests
- Limited by memory (KV cache)

Percentile Latencies:

P50: Median latency (typical experience)
P95: 95th percentile (most users)
P99: 99th percentile (worst common case)
Max: Absolute worst case

Target: P99 < 2× P50 for consistent experience

Benchmarking Tools

Tool        | Type           | Features
------------|----------------|-------------------------
LLMPerf     | LLM-specific   | TTFT, TPOT, concurrency
k6          | Load testing   | Flexible scripting
Locust      | Load testing   | Python-based, distributed
hey         | HTTP benchmark | Simple, quick tests
wrk         | HTTP benchmark | High performance
Custom      | Any            | Precise control

Simple Benchmark Script:

import time
import statistics
from openai import OpenAI

client = OpenAI()

def benchmark_request(prompt):
    start = time.time()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    first_token_time = None
    token_count = 0
    
    for chunk in response:
        if first_token_time is None:
            first_token_time = time.time()
        if chunk.choices[0].delta.content:
            token_count += 1
    
    end = time.time()
    
    return {
        "ttft": first_token_time - start,
        "total_time": end - start,
        "tokens": token_count,
        "tpot": (end - first_token_time) / token_count
    }

# Run multiple iterations
results = [benchmark_request("Explain quantum computing") for _ in range(10)]

# Calculate statistics
ttfts = [r["ttft"] for r in results]
print(f"TTFT P50: {statistics.median(ttfts):.3f}s")
print(f"TTFT P95: {sorted(ttfts)[int(len(ttfts)*0.95)]:.3f}s")

Load Testing with Locust:

from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(1, 3)
    
    @task
    def generate_response(self):
        self.client.post(
            "/v1/chat/completions",
            json={
                "model": "gpt-4o",
                "messages": [{"role": "user", "content": "Hello!"}]
            },
            headers={"Authorization": "Bearer ..."}
        )

Benchmark Methodology

┌─────────────────────────────────────────────────────┐
│ 1. Define Test Scenarios                            │
│    - Realistic prompts (varied lengths)             │
│    - Expected output lengths                        │
│    - Concurrency patterns                           │
├─────────────────────────────────────────────────────┤
│ 2. Establish Baseline                               │
│    - Warm up system                                 │
│    - Run baseline at low load                       │
│    - Record all metrics                             │
├─────────────────────────────────────────────────────┤
│ 3. Stress Test                                      │
│    - Gradually increase load                        │
│    - Find breaking point                            │
│    - Identify bottleneck                            │
├─────────────────────────────────────────────────────┤
│ 4. Analyze Results                                  │
│    - Plot latency vs. load                          │
│    - Calculate cost per request                     │
│    - Compare to requirements                        │
└─────────────────────────────────────────────────────┘

Best Practices

Benchmarking LLM performance is essential for production planning — without rigorous measurement, teams make infrastructure decisions based on hope rather than data, leading to either overspending or underprovisioning that impacts user experience.

benchmarking llmlatencythroughputttfttokens per secondload testingperformance metrics

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.