Efficient Inference (KV Cache, Speculative Decoding, Continuous Batching)

Efficient Inference (KV Cache, Speculative Decoding, Continuous Batching) is the set of systems-level optimizations that reduce the latency, throughput, and cost of serving large language model predictions in production — transforming LLM deployment from a prohibitively expensive endeavor into a scalable service capable of handling millions of concurrent requests.

The Inference Bottleneck

LLM inference is fundamentally memory-bandwidth-bound during autoregressive decoding: each generated token requires reading the entire model weights from GPU memory, but performs very little computation per byte loaded. For a 70B parameter model in FP16, generating one token reads ~140 GB of weights but performs only ~140 GFLOPS—far below the GPU's compute capacity. The arithmetic intensity (FLOPS/byte) is approximately 1, while modern GPUs offer 100-1000x more compute than memory bandwidth. This makes serving costs proportional to memory bandwidth rather than compute throughput.

KV Cache Mechanism and Optimization

- Cache purpose: During autoregressive generation, each new token's attention computation requires key and value vectors from all previous tokens; the KV cache stores these to avoid redundant recomputation
- Memory consumption: KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes; for LLaMA-70B with 4K context, this is ~2.5 GB per request
- PagedAttention (vLLM): Manages KV cache as virtual memory pages, eliminating fragmentation and enabling 2-4x more concurrent requests; pages allocated on-demand and freed when sequences complete
- KV cache compression: Quantizing KV cache to INT8 or INT4 halves or quarters memory with minimal quality impact; KIVI and Gear achieve 2-bit KV quantization
- Multi-Query/Grouped-Query Attention: Reduces KV cache size by sharing key-value heads across query heads (8x reduction for MQA, 4x for GQA)
- Sliding window eviction: Discard oldest KV entries beyond a window size; StreamingLLM maintains initial attention sink tokens plus recent window for infinite-length generation

Speculative Decoding

- Core idea: Use a small draft model to generate k candidate tokens quickly, then verify all k tokens in parallel with the large target model in a single forward pass
- Acceptance criterion: Each draft token is accepted if the target model would have generated it with at least as high probability; rejected tokens are resampled from the corrected distribution
- Speedup: 2-3x faster inference with zero quality degradation—the output distribution is mathematically identical to the target model alone
- Draft model selection: The draft model must be significantly faster (7B drafting for 70B target) while sharing vocabulary and producing reasonable approximations
- Self-speculative decoding: Uses early exit from the target model's own layers as the draft, avoiding the need for a separate draft model
- Medusa: Adds multiple prediction heads to the target model that predict future tokens in parallel, achieving speculative decoding without a separate draft model

Continuous Batching

- Problem with static batching: Naive batching waits until all sequences in a batch finish before starting new requests, wasting GPU cycles on padding for shorter sequences
- Iteration-level scheduling: Continuous batching (Orca, vLLM) inserts new requests into the batch as soon as existing sequences complete, maximizing GPU utilization
- Preemption: Lower-priority or longer requests can be preempted (KV cache swapped to CPU) to serve higher-priority incoming requests
- Throughput gains: Continuous batching achieves 10-20x higher throughput than static batching for variable-length workloads
- Prefill-decode disaggregation: Separate GPU pools for compute-intensive prefill (processing the prompt) and memory-bound decode (generating tokens), optimizing each phase independently

Model Parallelism for Serving

- Tensor parallelism: Split weight matrices across GPUs within a node; all-reduce synchronization per layer adds latency but enables serving models larger than single-GPU memory
- Pipeline parallelism: Distribute layers across GPUs; micro-batching hides pipeline bubbles; suitable for multi-node serving
- Expert parallelism for MoE: Route tokens to experts on different GPUs; all-to-all communication overhead managed by high-bandwidth interconnects
- Quantization: GPTQ, AWQ, and GGUF quantize weights to 4-bit with minimal accuracy loss, halving GPU memory requirements and doubling throughput

Serving Frameworks and Infrastructure

- vLLM: PagedAttention-based serving engine with continuous batching, tensor parallelism, and prefix caching; standard for open-source LLM serving
- TensorRT-LLM (NVIDIA): Optimized inference engine with INT4/INT8 quantization, in-flight batching, and custom CUDA kernels for maximum GPU utilization
- SGLang: Compiler-based approach with RadixAttention for automatic KV cache sharing across requests with common prefixes
- Prefix caching: Reuse KV cache for shared prompt prefixes across requests (system prompts, few-shot examples), reducing first-token latency by 5-10x for repeated prefixes

Efficient inference optimization has reduced LLM serving costs by 10-100x compared to naive implementations, with innovations in memory management, speculative execution, and batching strategies making it economically viable to serve frontier models to billions of users at interactive latencies.

Efficient Inference (KV Cache, Speculative Decoding, Continuous Batching)

Want to learn more?