Batching and throughput optimization is the technique of combining multiple inference requests into single GPU operations — processing batches of prompts together rather than individually, maximizing GPU utilization and tokens-per-second throughput, essential for cost-effective LLM serving at scale.
What Is Batching?
- Definition: Processing multiple requests in a single forward pass.
- Goal: Maximize GPU utilization and throughput.
- Trade-off: Higher throughput vs. increased per-request latency.
- Context: Critical for production LLM serving economics.
Why Batching Matters
- GPU Utilization: Single requests underutilize GPU compute.
- Cost Efficiency: More tokens per GPU-hour = lower cost per token.
- Scale: Handle more users with same hardware.
- Memory Amortization: Fixed overhead spread across more requests.
Batching Strategies
Static Batching:
- Fixed batch size, wait until batch is full.
- All requests start and end together.
- Simple but wasteful (padding, waiting).
Dynamic Batching:
- Accumulate requests within time window.
- Variable batch size based on arrivals.
- Better utilization than static.
Continuous Batching (State-of-the-art):
- Requests join/leave batch dynamically.
- New request can start while others are in progress.
- No waiting for batch completion.
- Implemented in vLLM, TGI, TensorRT-LLM.
In-Flight Batching:
- Mix prefill and decode phases in same batch.
- Maximize both compute (prefill) and memory (decode) utilization.
- Most efficient for heterogeneous request lengths.
Batch Size Trade-offs
Larger Batch Size:
┌────────────────────────────────────────────┐
│ ✅ Higher throughput (tokens/sec) │
│ ✅ Better GPU utilization │
│ ✅ Lower cost per token │
│ ❌ Higher per-request latency │
│ ❌ More memory for KV cache │
│ ❌ Longer queue wait times │
└────────────────────────────────────────────┘
Smaller Batch Size:
┌────────────────────────────────────────────┐
│ ✅ Lower latency per request │
│ ✅ Faster TTFT │
│ ❌ Underutilized GPU │
│ ❌ Higher cost per token │
└────────────────────────────────────────────┘
Memory Constraints
KV Cache Scaling:
KV Cache Memory = 2 × layers × hidden_size × seq_len × batch_size × dtype
Example (Llama 70B, 4K context, FP16):
= 2 × 80 × 8192 × 4096 × batch × 2 bytes
= 10.7 GB per sequence
Batch of 16 = 171 GB just for KV cache!
PagedAttention Solution:
- Allocate KV cache in pages, not contiguous blocks.
- Share common prefixes across requests.
- Dynamic allocation reduces fragmentation.
- Enables 2-4× higher throughput.
Throughput Optimization Techniques
Prefill Chunking:
- Split long prompts into smaller chunks.
- Process interleaved with decode tokens.
- Reduces TTFT variance.
Request Scheduling:
- Priority queues for latency-sensitive requests.
- Separate queues for long vs. short requests.
- Preemption for high-priority requests.
Multi-GPU Strategies:
- Tensor Parallel: Split model across GPUs.
- Pipeline Parallel: Split by layers.
- Data Parallel: Replicate model, split batches.
Throughput Benchmarks
Configuration | Tokens/sec | Latency
-----------------------------|------------|----------
Single request | 50-80 | 20ms/token
Batch 8, static | 300-400 | 35ms/token
Batch 32, continuous | 800-1200 | 50ms/token
Batch 64, PagedAttention | 1500-2500 | 70ms/token
Monitoring Metrics
- Queue Depth: Pending requests waiting for processing.
- Batch Utilization: Actual vs. maximum batch size.
- GPU Memory: KV cache utilization percentage.
- Time-in-Queue: Wait time before processing starts.
- Tokens/Second: Overall throughput metric.
Batching and throughput optimization is the key to LLM serving economics — without efficient batching, GPU utilization stays below 20% and costs are prohibitive; with modern continuous batching and PagedAttention, the same hardware serves 10× more users at fraction of the cost.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.