Dynamic Batching

Dynamic Batching is the inference serving technique that adaptively groups incoming requests into variable-size batches based on arrival patterns and timing constraints — waiting up to a maximum timeout for requests to accumulate before processing, enabling systems to automatically balance latency and throughput without manual tuning while maximizing GPU utilization across varying load conditions.

Dynamic Batching Fundamentals:
- Timeout-Based Accumulation: waits up to max_timeout (1-10ms typical) for requests to arrive; processes batch when timeout expires or max_batch_size reached; shorter timeout = lower latency, longer timeout = higher throughput
- Adaptive Batch Size: batch size varies from 1 (single request within timeout) to max_batch_size (many concurrent requests); automatically adapts to load — small batches during low traffic, large batches during high traffic
- Latency Guarantee: timeout provides upper bound on batching delay; total latency = batching_delay + inference_time + postprocessing; enables SLA compliance (e.g., p99 latency < 100ms)
- Throughput Maximization: during high load, batches fill quickly (minimal timeout waiting); GPU utilization approaches maximum; cost per request minimized through batch efficiency

Implementation Strategies:
- Queue-Based Batching: requests enter queue; batcher thread monitors queue and forms batches; simple but requires careful synchronization; TorchServe, TensorFlow Serving use this approach
- Event-Driven Batching: requests trigger batch formation events; uses async/await or callbacks; more complex but lower overhead; suitable for high-throughput systems
- Multi-Queue Batching: separate queues for different priorities or request types; high-priority queue has shorter timeout; enables differentiated service levels
- Hierarchical Batching: first-level batching at request router, second-level at model server; enables batching across multiple clients; reduces per-server load variance

Continuous Batching (Iteration-Level):
- Autoregressive Generation Challenge: traditional batching processes entire sequences together; sequences finish at different times (variable length outputs); GPU underutilized as batch shrinks
- Iteration-Level Batching: adds new requests to in-flight batches between generation steps; maintains constant batch size; dramatically improves throughput (10-20×) for LLM serving
- Orca Algorithm: tracks per-sequence generation state; adds new sequences when others finish; requires careful memory management (KV cache grows/shrinks dynamically)
- Paged Attention Integration: combines continuous batching with paged KV cache management; eliminates memory fragmentation; vLLM achieves 24× higher throughput than naive batching

Padding and Memory Management:
- Dynamic Padding: pads batch to longest sequence in current batch (not global maximum); reduces wasted computation; padding overhead varies by batch composition
- Bucketing with Dynamic Batching: pre-defined length buckets (0-64, 64-128, ...); dynamic batching within each bucket; combines benefits of bucketing (reduced padding) and dynamic batching (adaptive throughput)
- Memory Reservation: pre-allocates memory for max_batch_size; avoids allocation overhead during serving; trades memory for latency predictability
- Attention Mask Optimization: computes attention only on non-padded tokens; Flash Attention with variable-length support; eliminates padding computation overhead

Timeout and Batch Size Tuning:
- Latency-Throughput Curve: profile system at various timeout values (0.1ms, 1ms, 5ms, 10ms); plot latency vs throughput; select timeout based on application requirements
- Adaptive Timeout: adjusts timeout based on current load; shorter timeout during low load (minimize latency), longer during high load (maximize throughput); requires careful tuning to avoid oscillation
- Batch Size Limits: max_batch_size limited by GPU memory; larger models require smaller batches; profile to find maximum feasible batch size; consider memory for activations, KV cache, and intermediate tensors
- Multi-Objective Optimization: balance latency, throughput, and cost; Pareto frontier analysis; different applications have different priorities (real-time vs batch processing)

Priority and Fairness:
- Priority Queues: high-priority requests processed first; may preempt low-priority batches; ensures SLA compliance for critical requests
- Fair Batching: ensures no request starves; oldest request in queue included in next batch; prevents priority inversion
- Weighted Fair Queuing: allocates batch slots proportionally to request weights; enables differentiated service levels; enterprise customers get more slots than free tier
- Deadline-Aware Batching: considers request deadlines when forming batches; processes requests with nearest deadlines first; minimizes SLA violations

Framework Support:
- NVIDIA Triton: dynamic batching with configurable timeout and max_batch_size; supports multiple models and backends; production-grade with monitoring and metrics
- TorchServe: dynamic batching via batch_size and max_batch_delay parameters; integrates with PyTorch models; supports custom batching logic
- TensorFlow Serving: batching via --enable_batching flag; configurable batch_timeout_micros and max_batch_size; high-performance C++ implementation
- vLLM: continuous batching for LLMs; paged attention for memory efficiency; 10-20× higher throughput than static batching; supports popular LLMs (Llama, Mistral, GPT)
- Text Generation Inference (TGI): Hugging Face's LLM serving with continuous batching; optimized for Transformers; supports quantization and tensor parallelism

Monitoring and Observability:
- Batch Size Distribution: histogram of actual batch sizes; identifies underutilization (many small batches) or saturation (always max_batch_size)
- Timeout Utilization: fraction of batches triggered by timeout vs max_batch_size; high timeout utilization indicates low load; low indicates high load
- Queue Depth: number of requests waiting for batching; high queue depth indicates insufficient capacity; triggers autoscaling
- Latency Breakdown: separate batching delay, inference time, and postprocessing; identifies bottlenecks; guides optimization efforts

Advanced Techniques:
- Speculative Batching: batches draft model generation separately from verification; different batch sizes for different stages; optimizes for different computational characteristics
- Multi-Model Batching: batches requests for different models together; requires model multiplexing or multi-model serving; increases overall GPU utilization
- Prefill-Decode Separation: separates prompt processing (prefill) from token generation (decode); different batching strategies for each phase; prefill uses large batches, decode uses continuous batching
- Batch Splitting: splits large batches into smaller sub-batches for better load balancing; useful when batch processing time varies significantly

Challenges and Solutions:
- Cold Start: first request after idle period has no batching benefit; warm-up requests or keep-alive pings maintain readiness
- Bursty Traffic: sudden traffic spikes cause queue buildup; autoscaling with predictive scaling (anticipate spikes) or reactive scaling (respond to queue depth)
- Variable Sequence Length: long sequences dominate batch processing time; separate queues or buckets for different length ranges; prevents head-of-line blocking
- Memory Fragmentation: variable batch sizes cause memory fragmentation; memory pooling and paged attention mitigate; pre-allocation for common batch sizes

Dynamic batching is the essential technique for production AI serving — automatically adapting to traffic patterns to maximize GPU utilization and throughput while maintaining latency guarantees, enabling cost-effective serving that scales from single requests per second to thousands without manual intervention or performance degradation.

Want to learn more?