LLM inference and serving is the process of deploying trained language models as production services — handling user requests by running model forward passes to generate text, optimizing for throughput, latency, and cost, enabling scalable AI applications from chatbots to code assistants to enterprise automation.
What Is LLM Inference?
- Definition: Running a trained model to generate predictions/outputs.
- Process: Encode input tokens → forward pass → decode output tokens.
- Mode: Autoregressive generation (one token at a time).
- Challenge: Optimize for speed, memory, and cost at scale.
Why Inference Optimization Matters
- Cost: Inference is 90%+ of LLM operational cost.
- User Experience: Low latency critical for interactive applications.
- Scale: Handle thousands of concurrent users.
- Efficiency: Maximize throughput per GPU dollar.
- Competitive: Faster responses drive user preference.
Key Performance Metrics
Latency Metrics:
- TTFT (Time to First Token): Prefill latency, how fast response starts.
- TPOT (Time Per Output Token): Decode latency, generation speed.
- E2E (End-to-End): Total response time including prefill + decode.
Throughput Metrics:
- Requests/Second: Number of completed requests per second.
- Tokens/Second: Total token generation throughput.
- Concurrent Users: Active simultaneous conversations.
Inference Phases
Prefill (Prompt Processing):
- Process all input tokens in parallel.
- Compute-bound: Uses full GPU compute.
- Generate initial KV cache.
- Latency proportional to prompt length.
Decode (Token Generation):
- Generate one token at a time.
- Memory-bound: KV cache access dominates.
- Each token requires full model forward pass.
- Latency proportional to output length.
Serving Frameworks
Framework | Key Features | Best For
---------------|--------------------------------|---------------
vLLM | PagedAttention, continuous batch| General serving
TensorRT-LLM | NVIDIA kernels, fastest | NVIDIA GPUs
TGI | Hugging Face, production ready | HF ecosystem
llama.cpp | CPU/consumer GPU, GGUF format | Local/edge
Triton | Multi-model, enterprise | Complex pipelines
Optimization Techniques
Memory Optimizations:
- PagedAttention: Dynamic KV cache allocation (vLLM).
- Quantized KV Cache: INT8/INT4 cache reduces memory 2-4×.
- GQA/MQA: Fewer KV heads reduces cache size.
- Prefix Caching: Reuse KV cache for common prefixes.
Compute Optimizations:
- Quantization: INT8/INT4 weights reduce memory bandwidth.
- Flash Attention: Fused, memory-efficient attention kernels.
- Tensor Parallelism: Split model across GPUs.
- Speculative Decoding: Draft model predicts, main model verifies.
Batching Strategies:
- Static Batching: Fixed batch, wait for all to complete.
- Continuous Batching: Dynamic batch, process as available.
- In-Flight Batching: Mix prefill and decode phases.
Serving Architecture
Client Requests
↓
┌─────────────────────────────────────┐
│ Load Balancer │
├─────────────────────────────────────┤
│ API Gateway (Auth, Rate Limit) │
├─────────────────────────────────────┤
│ Request Queue / Scheduler │
├─────────────────────────────────────┤
│ Inference Engine │
│ ├─ Model Worker 1 (GPU 0-3) │
│ ├─ Model Worker 2 (GPU 4-7) │
│ └─ Model Worker N │
├─────────────────────────────────────┤
│ Response Streaming (SSE/WebSocket)│
└─────────────────────────────────────┘
↓
Client Response (streaming)
Cloud Deployment Options
- Managed APIs: OpenAI, Anthropic, Google (no infrastructure).
- Serverless GPU: Replicate, Modal, RunPod, Banana.
- Self-Hosted Cloud: AWS, GCP, Azure GPU instances.
- On-Premise: NVIDIA DGX, custom GPU servers.
LLM inference and serving is where model capability meets production reality — optimizing this pipeline determines whether AI applications are fast and cost-effective or slow and expensive, making inference engineering critical for any serious AI deployment.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.