Efficient Inference (KV Cache, Speculative Decoding, Continuous Batching) is the set of systems-level optimizations that reduce the latency, throughput, and cost of serving large language model predictions in production — transforming LLM deployment from a prohibitively expensive endeavor into a scalable service capable of handling millions of concurrent requests.
The Inference Bottleneck
LLM inference is fundamentally memory-bandwidth-bound during autoregressive decoding: each generated token requires reading the entire model weights from GPU memory, but performs very little computation per byte loaded. For a 70B parameter model in FP16, generating one token reads ~140 GB of weights but performs only ~140 GFLOPS—far below the GPU's compute capacity. The arithmetic intensity (FLOPS/byte) is approximately 1, while modern GPUs offer 100-1000x more compute than memory bandwidth. This makes serving costs proportional to memory bandwidth rather than compute throughput.
KV Cache Mechanism and Optimization
- Cache purpose: During autoregressive generation, each new token's attention computation requires key and value vectors from all previous tokens; the KV cache stores these to avoid redundant recomputation
- Memory consumption: KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes; for LLaMA-70B with 4K context, this is ~2.5 GB per request
- PagedAttention (vLLM): Manages KV cache as virtual memory pages, eliminating fragmentation and enabling 2-4x more concurrent requests; pages allocated on-demand and freed when sequences complete
- KV cache compression: Quantizing KV cache to INT8 or INT4 halves or quarters memory with minimal quality impact; KIVI and Gear achieve 2-bit KV quantization
- Multi-Query/Grouped-Query Attention: Reduces KV cache size by sharing key-value heads across query heads (8x reduction for MQA, 4x for GQA)
- Sliding window eviction: Discard oldest KV entries beyond a window size; StreamingLLM maintains initial attention sink tokens plus recent window for infinite-length generation
Speculative Decoding
- Core idea: Use a small draft model to generate k candidate tokens quickly, then verify all k tokens in parallel with the large target model in a single forward pass
- Acceptance criterion: Each draft token is accepted if the target model would have generated it with at least as high probability; rejected tokens are resampled from the corrected distribution
- Speedup: 2-3x faster inference with zero quality degradation—the output distribution is mathematically identical to the target model alone
- Draft model selection: The draft model must be significantly faster (7B drafting for 70B target) while sharing vocabulary and producing reasonable approximations
- Self-speculative decoding: Uses early exit from the target model's own layers as the draft, avoiding the need for a separate draft model
- Medusa: Adds multiple prediction heads to the target model that predict future tokens in parallel, achieving speculative decoding without a separate draft model
Continuous Batching
- Problem with static batching: Naive batching waits until all sequences in a batch finish before starting new requests, wasting GPU cycles on padding for shorter sequences
- Iteration-level scheduling: Continuous batching (Orca, vLLM) inserts new requests into the batch as soon as existing sequences complete, maximizing GPU utilization
- Preemption: Lower-priority or longer requests can be preempted (KV cache swapped to CPU) to serve higher-priority incoming requests
- Throughput gains: Continuous batching achieves 10-20x higher throughput than static batching for variable-length workloads
- Prefill-decode disaggregation: Separate GPU pools for compute-intensive prefill (processing the prompt) and memory-bound decode (generating tokens), optimizing each phase independently
Model Parallelism for Serving
- Tensor parallelism: Split weight matrices across GPUs within a node; all-reduce synchronization per layer adds latency but enables serving models larger than single-GPU memory
- Pipeline parallelism: Distribute layers across GPUs; micro-batching hides pipeline bubbles; suitable for multi-node serving
- Expert parallelism for MoE: Route tokens to experts on different GPUs; all-to-all communication overhead managed by high-bandwidth interconnects
- Quantization: GPTQ, AWQ, and GGUF quantize weights to 4-bit with minimal accuracy loss, halving GPU memory requirements and doubling throughput
Serving Frameworks and Infrastructure
- vLLM: PagedAttention-based serving engine with continuous batching, tensor parallelism, and prefix caching; standard for open-source LLM serving
- TensorRT-LLM (NVIDIA): Optimized inference engine with INT4/INT8 quantization, in-flight batching, and custom CUDA kernels for maximum GPU utilization
- SGLang: Compiler-based approach with RadixAttention for automatic KV cache sharing across requests with common prefixes
- Prefix caching: Reuse KV cache for shared prompt prefixes across requests (system prompts, few-shot examples), reducing first-token latency by 5-10x for repeated prefixes
Efficient inference optimization has reduced LLM serving costs by 10-100x compared to naive implementations, with innovations in memory management, speculative execution, and batching strategies making it economically viable to serve frontier models to billions of users at interactive latencies.