Home Knowledge Base KV cache

KV cache is the memory buffer storing previously computed key and value tensors during autoregressive LLM inference — avoiding redundant computation by caching intermediate results, but requiring significant GPU memory that scales with sequence length and batch size, making cache management critical for efficient serving.

What Is KV Cache?

Why KV Cache Matters

How KV Cache Works

Autoregressive Generation:

Without KV Cache (naive):
Step 1: Compute K,V for [token1]
Step 2: Recompute K,V for [token1, token2]
Step 3: Recompute K,V for [token1, token2, token3]
        ...each step recomputes everything!

With KV Cache:
Step 1: Compute K,V for [token1], cache it
Step 2: Compute K,V for [token2] only, append to cache
Step 3: Compute K,V for [token3] only, append to cache
        ...only compute new token each step

Memory Layout:

┌─────────────────────────────────────────┐
│              KV Cache                   │
├─────────────────────────────────────────┤
│ Layer 1: K [batch, heads, seq, head_dim]│
│          V [batch, heads, seq, head_dim]│
├─────────────────────────────────────────┤
│ Layer 2: K [...], V [...]               │
├─────────────────────────────────────────┤
│ ...                                     │
├─────────────────────────────────────────┤
│ Layer L: K [...], V [...]               │
└─────────────────────────────────────────┘

Memory Calculation

KV Cache Size = 2 × L × H × S × B × dtype_size

Where:
- 2 = keys and values
- L = number of layers
- H = hidden dimension
- S = sequence length
- B = batch size
- dtype = FP16 (2 bytes) or FP8 (1 byte)

Example (Llama-70B, 4K context, batch=1, FP16):
= 2 × 80 layers × 8192 hidden × 4096 seq × 1 × 2 bytes
= 10.7 GB per sequence!

Batch of 8 = 86 GB just for KV cache

KV Cache Optimizations

PagedAttention (vLLM):

Traditional: Contiguous memory per sequence (fragmentation)
PagedAttention: Memory in fixed-size pages (like OS virtual memory)

Benefits:
- No fragmentation
- Share pages across requests (prefix caching)
- Dynamic allocation
- 2-4× higher throughput

Quantized KV Cache:

Store cache in INT8 or INT4 instead of FP16
Memory reduction: 2-4×
Quality impact: Minimal for most models

FP16: 16 bits/value
INT8: 8 bits/value (2× reduction)
INT4: 4 bits/value (4× reduction)

Grouped Query Attention (GQA):

Standard MHA: heads_k = heads_q = 32
GQA: heads_k = 8, heads_q = 32

KV cache 4× smaller with GQA
Most modern models use GQA

Multi-Query Attention (MQA):

MQA: heads_k = 1, heads_q = 32
Even smaller cache, some quality trade-off

Prefix Caching:

System prompt: "You are a helpful assistant..."
This is same across requests → compute once, share KV

First request: Compute full KV for system prompt
Later requests: Reuse cached system prompt KV
Savings: Skip prefill for common prompts

Memory Comparison

Optimization      | Memory | Implementation
------------------|--------|------------------
Baseline FP16     | 100%   | Standard
INT8 KV           | 50%    | Most frameworks
INT4 KV           | 25%    | Some frameworks
GQA (4 groups)    | 25%    | Model architecture
GQA + INT8        | 12.5%  | Combined
PagedAttention    | ~60-80%| vLLM (less fragmentation)

Sliding Window Attention

Instead of attending to full history:
- Only attend to last W tokens
- KV cache capped at W entries
- Used in Mistral (W=4096)

Trade-off: Bounded memory vs. long-range attention

KV cache management is the critical bottleneck in LLM inference — as context windows grow to 100K+ tokens and users expect real-time responses, efficient cache strategies determine whether serving is practical and affordable, making KV optimization essential infrastructure.

memorykv cachekvcacheattention cachepaged attentiongqamqacontext length

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.