← Back to AI Factory Chat

AI Factory Glossary

103 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 3 of 3 (103 entries)

kv cache,key value,memory

The key-value (KV) cache stores precomputed key and value tensors from previous tokens during autoregressive generation, avoiding redundant computation as sequences extend. Without caching, generating token N requires recomputing attention for all N-1 previous tokens—O(N²) complexity per sequence. With KV cache, only the new token's Q×K and attention×V computations are needed—O(N) per token. However, KV cache grows linearly with sequence length and batch size, often becoming the dominant memory consumer. For LLaMA-2-70B with 80 layers, 64 heads, and 128-dimensional heads, each token requires 2 (K+V) × 80 × 64 × 128 × 2 bytes = 2.6MB in FP16. A batch of 8 sequences at 4K context consumes 85GB—exceeding single GPU memory. Optimization techniques include: grouped-query attention (GQA) sharing K/V across heads (8x reduction), KV cache quantization to INT8 or INT4, sliding window attention limiting cache to recent tokens, and PagedAttention for memory-efficient management. The KV cache fundamentally shapes inference system design, determining maximum batch sizes, context lengths, and overall throughput. Efficient KV cache management is essential for production LLM serving.

kv cache,llm architecture

KV cache stores computed key-value pairs to accelerate autoregressive LLM inference. **How it works**: During generation, each token attends to all previous tokens. Rather than recomputing K and V for all past tokens, cache and reuse them. Only compute K, V for the new token. **Memory cost**: Cache grows linearly with sequence length and batch size: batch_size × num_layers × 2 × seq_len × hidden_dim × precision_bytes. For 70B model with 32K context, can be 40GB+. **Optimization techniques**: KV cache quantization (FP8, INT8), paged attention (vLLM) for dynamic allocation, sliding window for bounded memory, grouped-query attention reduces K, V heads, shared KV layers. **Implementation**: Pre-allocate for max sequence length or dynamic growth. Store per-layer. Handle variable batch sizes. **Impact**: Enables 10-100x faster generation vs naive recomputation. Critical for production LLM serving. **Memory-speed trade-off**: Larger caches enable faster generation but limit batch size. Optimize based on latency vs throughput requirements.

kv cache,prefix caching,cache

**KV Cache and Prefix Caching** **What is KV Cache?** During autoregressive generation, the model computes key (K) and value (V) tensors for attention. Caching these avoids recomputation on each new token. **How KV Cache Works** **Without Cache** Every token generation recomputes attention for the entire sequence: ``` Token 1: Compute K,V for position 0 Token 2: Compute K,V for positions 0,1 (recompute!) Token 3: Compute K,V for positions 0,1,2 (recompute!) ``` Quadratic complexity. **With Cache** Store K,V from previous steps: ``` Token 1: Compute K,V for position 0, cache it Token 2: Retrieve cached K,V, compute only position 1, append to cache Token 3: Retrieve cached K,V, compute only position 2, append to cache ``` Linear complexity for generation. **KV Cache Size** ``` Cache size = 2 × num_layers × seq_len × hidden_dim × num_kv_heads × bytes_per_param ``` Example for Llama-2 7B (BF16, 4K context): - 32 layers × 4096 seq × 4096 dim × 32 heads × 2 bytes - ≈ 1 GB per request **Prefix Caching** **The Problem** Different requests often share common prefixes (system prompts): ``` Request 1: [System prompt] + [User query A] Request 2: [System prompt] + [User query B] ``` Without caching: Recompute system prompt KV for every request. **With Prefix Caching** Compute and cache system prompt KV once, reuse for all requests: ``` Prefix cache: System prompt KV (compute once) Request 1: Reuse prefix + compute query A KV Request 2: Reuse prefix + compute query B KV ``` **PagedAttention (vLLM)** Instead of contiguous memory for KV cache: - Allocate in blocks (like virtual memory pages) - Share blocks for common prefixes - Efficient memory utilization ``` Physical blocks: [Block 0][Block 1][Block 2][Block 3] Request 1 KV: [ 0 ][ 1 ] Request 2 KV: [ 0 ][ 2 ] (shares prefix block 0) ``` **Using Prefix Caching** **vLLM** ```bash # Enable automatic prefix caching python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf --enable-prefix-caching ``` **Benefits** | Metric | Without Prefix Cache | With Prefix Cache | |--------|---------------------|-------------------| | TTFT | ~500ms | ~50ms (for cached prefix) | | Throughput | Baseline | 2-3x higher | | Memory | Per-request | Shared | Prefix caching is especially valuable for chat applications with consistent system prompts.