KV cache is the memory buffer storing previously computed key and value tensors during autoregressive LLM inference — avoiding redundant computation by caching intermediate results, but requiring significant GPU memory that scales with sequence length and batch size, making cache management critical for efficient serving.
What Is KV Cache?
- Definition: Cached key-value pairs from attention computation.
- Purpose: Avoid recomputing previous token representations each step.
- Growth: Linear with sequence length × layers × batch size.
- Challenge: Major memory bottleneck for long contexts and batching.
Why KV Cache Matters
- Efficiency: Without caching, cost would be O(n²) per token.
- Memory: Can exceed model weights for long sequences.
- Throughput: KV cache size limits batch size.
- Long Context: 100K+ contexts need cache optimization.
- Cost: Memory management directly impacts inference cost.
How KV Cache Works
Autoregressive Generation:
Without KV Cache (naive):
Step 1: Compute K,V for [token1]
Step 2: Recompute K,V for [token1, token2]
Step 3: Recompute K,V for [token1, token2, token3]
...each step recomputes everything!
With KV Cache:
Step 1: Compute K,V for [token1], cache it
Step 2: Compute K,V for [token2] only, append to cache
Step 3: Compute K,V for [token3] only, append to cache
...only compute new token each step
Memory Layout:
┌─────────────────────────────────────────┐
│ KV Cache │
├─────────────────────────────────────────┤
│ Layer 1: K [batch, heads, seq, head_dim]│
│ V [batch, heads, seq, head_dim]│
├─────────────────────────────────────────┤
│ Layer 2: K [...], V [...] │
├─────────────────────────────────────────┤
│ ... │
├─────────────────────────────────────────┤
│ Layer L: K [...], V [...] │
└─────────────────────────────────────────┘
Memory Calculation
KV Cache Size = 2 × L × H × S × B × dtype_size
Where:
- 2 = keys and values
- L = number of layers
- H = hidden dimension
- S = sequence length
- B = batch size
- dtype = FP16 (2 bytes) or FP8 (1 byte)
Example (Llama-70B, 4K context, batch=1, FP16):
= 2 × 80 layers × 8192 hidden × 4096 seq × 1 × 2 bytes
= 10.7 GB per sequence!
Batch of 8 = 86 GB just for KV cache
KV Cache Optimizations
PagedAttention (vLLM):
Traditional: Contiguous memory per sequence (fragmentation)
PagedAttention: Memory in fixed-size pages (like OS virtual memory)
Benefits:
- No fragmentation
- Share pages across requests (prefix caching)
- Dynamic allocation
- 2-4× higher throughput
Quantized KV Cache:
Store cache in INT8 or INT4 instead of FP16
Memory reduction: 2-4×
Quality impact: Minimal for most models
FP16: 16 bits/value
INT8: 8 bits/value (2× reduction)
INT4: 4 bits/value (4× reduction)
Grouped Query Attention (GQA):
Standard MHA: heads_k = heads_q = 32
GQA: heads_k = 8, heads_q = 32
KV cache 4× smaller with GQA
Most modern models use GQA
Multi-Query Attention (MQA):
MQA: heads_k = 1, heads_q = 32
Even smaller cache, some quality trade-off
Prefix Caching:
System prompt: "You are a helpful assistant..."
This is same across requests → compute once, share KV
First request: Compute full KV for system prompt
Later requests: Reuse cached system prompt KV
Savings: Skip prefill for common prompts
Memory Comparison
Optimization | Memory | Implementation
------------------|--------|------------------
Baseline FP16 | 100% | Standard
INT8 KV | 50% | Most frameworks
INT4 KV | 25% | Some frameworks
GQA (4 groups) | 25% | Model architecture
GQA + INT8 | 12.5% | Combined
PagedAttention | ~60-80%| vLLM (less fragmentation)
Sliding Window Attention
Instead of attending to full history:
- Only attend to last W tokens
- KV cache capped at W entries
- Used in Mistral (W=4096)
Trade-off: Bounded memory vs. long-range attention
KV cache management is the critical bottleneck in LLM inference — as context windows grow to 100K+ tokens and users expect real-time responses, efficient cache strategies determine whether serving is practical and affordable, making KV optimization essential infrastructure.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.