Model Serving Systems

Model Serving Systems are the production infrastructure for deploying trained neural networks as scalable, reliable services — providing request handling, batching, load balancing, versioning, monitoring, and fault tolerance to bridge the gap between research models and production applications serving millions of requests per day with strict latency and availability requirements.

Core Serving Components:
- Model Server: loads model weights, handles inference requests, manages GPU memory; examples: TorchServe, TensorFlow Serving, NVIDIA Triton; provides REST/gRPC APIs for client requests; handles model lifecycle (load, unload, update)
- Request Router: distributes incoming requests across model replicas; implements load balancing strategies (round-robin, least-connections, latency-aware); handles request queuing and timeout management
- Batch Scheduler: groups individual requests into batches for efficient GPU utilization; implements dynamic batching (wait up to timeout for batch to fill) or continuous batching (add requests to in-flight batches); critical for throughput optimization
- Model Repository: stores model artifacts (weights, configs, metadata); supports versioning and rollback; examples: S3, GCS, model registries (MLflow, Weights & Biases); enables A/B testing and canary deployments

Batching Strategies:
- Static Batching: fixed batch size, waits for batch to fill before inference; maximizes GPU utilization but increases latency; suitable for offline/batch processing
- Dynamic Batching: waits up to timeout (1-10ms) for requests to accumulate; balances latency and throughput; timeout is critical hyperparameter (lower = lower latency, higher = higher throughput)
- Continuous Batching (Orca): for autoregressive models, adds new requests between generation steps; dramatically improves throughput (10-20×) by keeping GPU busy; vLLM, TGI (Text Generation Inference) implement continuous batching
- Selective Batching: groups requests with similar characteristics (length, priority); reduces padding overhead; improves efficiency for heterogeneous workloads

Scaling and Load Balancing:
- Horizontal Scaling: deploys multiple model replicas across GPUs/servers; load balancer distributes requests; scales throughput linearly with replicas; simplest and most common scaling approach
- Vertical Scaling: uses larger GPUs or more GPUs per replica; enables serving larger models; limited by single-node GPU count (typically 8 GPUs)
- Model Parallelism: splits single model across multiple GPUs; tensor parallelism (split layers) or pipeline parallelism (different layers on different GPUs); enables serving models larger than single GPU memory
- Auto-Scaling: dynamically adjusts replica count based on load; scales up during traffic spikes, down during low traffic; Kubernetes HPA (Horizontal Pod Autoscaler) or custom autoscalers; requires careful tuning to avoid thrashing

Model Versioning and Deployment:
- Blue-Green Deployment: maintains two environments (blue=current, green=new); switches traffic to green after validation; enables instant rollback by switching back to blue
- Canary Deployment: gradually shifts traffic to new version (5% → 25% → 50% → 100%); monitors metrics at each stage; rolls back if metrics degrade; reduces risk of bad deployments
- A/B Testing: serves multiple model versions simultaneously; routes requests based on user ID or random assignment; compares metrics to determine better version; enables data-driven model selection
- Shadow Deployment: new model receives copy of production traffic but responses are discarded; validates new model behavior without affecting users; identifies issues before full deployment

Monitoring and Observability:
- Latency Metrics: p50, p95, p99 latency; tracks distribution of response times; p99 latency critical for user experience (1% of requests shouldn't be extremely slow)
- Throughput Metrics: requests per second, tokens per second (for LLMs); measures system capacity; tracks GPU utilization to identify underutilization or saturation
- Error Rates: tracks 4xx (client errors) and 5xx (server errors); monitors model failures (OOM, timeout, numerical errors); alerts on elevated error rates
- Model Metrics: accuracy, F1, BLEU, or task-specific metrics; monitors for model degradation or distribution shift; requires ground truth labels (delayed or sampled)
- Resource Utilization: GPU memory, GPU utilization, CPU, network bandwidth; identifies bottlenecks; guides capacity planning

Fault Tolerance and Reliability:
- Health Checks: periodic checks to verify model server is responsive; removes unhealthy replicas from load balancer; Kubernetes liveness and readiness probes
- Graceful Degradation: serves cached responses or fallback model when primary model fails; maintains partial functionality during outages; critical for user-facing applications
- Request Retry: automatically retries failed requests with exponential backoff; handles transient failures (network issues, temporary overload); requires idempotency to avoid duplicate processing
- Circuit Breaker: stops sending requests to failing service after threshold; prevents cascading failures; automatically retries after cooldown period

Optimization Techniques:
- Model Compilation: TensorRT, ONNX Runtime, TorchScript optimize models for inference; graph fusion, precision calibration, kernel auto-tuning; 2-10× speedup over native frameworks
- Quantization: INT8 or INT4 quantization reduces memory and increases throughput; post-training quantization (PTQ) or quantization-aware training (QAT); 2-4× speedup with <1% accuracy loss
- KV Cache Management: for LLMs, caches key-value pairs from previous tokens; paged attention (vLLM) eliminates memory fragmentation; enables 2-24× higher throughput
- Prompt Caching: caches intermediate activations for common prompt prefixes; subsequent requests reuse cached activations; effective for chatbots with system prompts

Multi-Model Serving:
- Model Multiplexing: serves multiple models on same GPU; time-slices GPU between models; increases utilization but adds scheduling overhead
- Adapter-Based Serving: base model shared across tasks, task-specific adapters (LoRA) loaded on-demand; adapters are 2-50MB vs 14-140GB for full model; enables serving thousands of personalized models
- Ensemble Serving: combines predictions from multiple models; improves accuracy through diversity; increases latency and cost; used for high-stakes applications

Serving Frameworks:
- TorchServe: PyTorch's official serving framework; supports dynamic batching, multi-model serving, metrics, and logging; integrates with AWS SageMaker
- TensorFlow Serving: TensorFlow's serving system; high-performance C++ implementation; supports versioning, batching, and model warmup; widely used in production
- NVIDIA Triton: multi-framework serving (PyTorch, TensorFlow, ONNX, TensorRT); advanced batching, model ensembles, and backend flexibility; optimized for NVIDIA GPUs
- vLLM: specialized LLM serving with continuous batching and paged attention; 10-20× higher throughput than naive serving; supports popular LLMs (Llama, Mistral, GPT)
- Ray Serve: general-purpose serving built on Ray; supports arbitrary Python code; flexible but less optimized than specialized frameworks

Edge and Mobile Serving:
- On-Device Inference: runs models directly on phones/IoT devices; TensorFlow Lite, Core ML, ONNX Runtime Mobile; requires model compression (quantization, pruning)
- Federated Serving: distributes inference across edge devices; reduces latency and bandwidth; privacy-preserving (data stays on device)
- Hybrid Serving: simple models on-device, complex models in cloud; balances latency, cost, and capability; fallback to cloud when on-device model is uncertain

Model serving systems are the production backbone of AI applications — transforming research prototypes into reliable, scalable services that handle millions of requests with millisecond latencies, providing the infrastructure that makes AI useful in the real world rather than just impressive in papers.

Want to learn more?