Home Knowledge Base Batching and throughput optimization

Batching and throughput optimization is the technique of combining multiple inference requests into single GPU operations — processing batches of prompts together rather than individually, maximizing GPU utilization and tokens-per-second throughput, essential for cost-effective LLM serving at scale.

What Is Batching?

Why Batching Matters

Batching Strategies

Static Batching:

Dynamic Batching:

Continuous Batching (State-of-the-art):

In-Flight Batching:

Batch Size Trade-offs

Larger Batch Size:
┌────────────────────────────────────────────┐
│ ✅ Higher throughput (tokens/sec)          │
│ ✅ Better GPU utilization                  │
│ ✅ Lower cost per token                    │
│ ❌ Higher per-request latency              │
│ ❌ More memory for KV cache                │
│ ❌ Longer queue wait times                 │
└────────────────────────────────────────────┘

Smaller Batch Size:
┌────────────────────────────────────────────┐
│ ✅ Lower latency per request               │
│ ✅ Faster TTFT                             │
│ ❌ Underutilized GPU                       │
│ ❌ Higher cost per token                   │
└────────────────────────────────────────────┘

Memory Constraints

KV Cache Scaling:

KV Cache Memory = 2 × layers × hidden_size × seq_len × batch_size × dtype

Example (Llama 70B, 4K context, FP16):
= 2 × 80 × 8192 × 4096 × batch × 2 bytes
= 10.7 GB per sequence

Batch of 16 = 171 GB just for KV cache!

PagedAttention Solution:

Throughput Optimization Techniques

Prefill Chunking:

Request Scheduling:

Multi-GPU Strategies:

Throughput Benchmarks

Configuration                | Tokens/sec | Latency
-----------------------------|------------|----------
Single request               | 50-80      | 20ms/token
Batch 8, static              | 300-400    | 35ms/token
Batch 32, continuous         | 800-1200   | 50ms/token
Batch 64, PagedAttention     | 1500-2500  | 70ms/token

Monitoring Metrics

Batching and throughput optimization is the key to LLM serving economics — without efficient batching, GPU utilization stays below 20% and costs are prohibitive; with modern continuous batching and PagedAttention, the same hardware serves 10× more users at fraction of the cost.

batchbatch sizethroughputcontinuous batchingpaged attentiongpu utilization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.