Docker and Kubernetes for ML

Keywords: docker ml, kubernetes, containers, gpu docker, kserve, kubeflow, model serving, deployment

Docker and Kubernetes for ML provide containerization and orchestration infrastructure for deploying machine learning models at scale — packaging models with dependencies into portable containers and managing clusters of GPU-enabled nodes for production serving, training jobs, and auto-scaling inference workloads.

Why Containers for ML?

- Reproducibility: Same environment everywhere (dev, test, prod).
- Dependency Isolation: No conflicts between project requirements.
- Portability: Run anywhere containers run.
- Scaling: Deploy multiple instances easily.
- GPU Support: NVIDIA Container Toolkit enables GPU access.

Docker Basics for ML

Basic Dockerfile:
``dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y python3 python3-pip

# Install dependencies
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# Copy application code
COPY . /app
WORKDIR /app

# Run inference server
CMD ["python3", "serve.py"]
`

Optimized Multi-Stage Build:
`dockerfile
# Build stage
FROM python:3.10-slim AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Runtime stage
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
COPY --from=builder /root/.local /root/.local
COPY . /app
WORKDIR /app
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "serve.py"]
`

GPU in Docker:
`bash
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Run with GPU access
docker run --gpus all -it my-ml-image

# Specific GPUs
docker run --gpus device=0,1 -it my-ml-image
`

Docker Compose for ML:
`yaml
version: "3.8"
services:
inference:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- ./models:/app/models
environment:
- MODEL_PATH=/app/models/model.pt
`

Kubernetes for ML

Why Kubernetes?:
- Scale inference across many nodes.
- Manage GPU allocation automatically.
- Self-healing: restart failed pods.
- Load balancing across replicas.
- Rolling updates without downtime.

Deployment Example:
`yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
spec:
replicas: 3
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: inference
image: my-registry/llm-server:v1
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
`

Service & Load Balancing:
`yaml
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-inference
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
`

Horizontal Pod Autoscaler:
`yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
`

ML Platforms on Kubernetes

`
Platform | Purpose | Use Case
-----------|----------------------------|-----------------------
KServe | Model serving | Deploy models easily
Kubeflow | Full ML pipeline | Training + serving
Ray | Distributed compute | Large-scale training
Seldon | ML deployment platform | Enterprise serving
MLflow | Experiment tracking | Model versioning
``

Best Practices

Container Best Practices:
- Use specific version tags, not :latest.
- Multi-stage builds to reduce image size.
- Don't include training data in images.
- Use .dockerignore to exclude unnecessary files.
- Health checks for readiness/liveness.

K8s Best Practices:
- Set resource requests AND limits.
- Use NVIDIA device plugin for GPU scheduling.
- Implement graceful shutdown for model unloading.
- Use PersistentVolumes for model storage.
- Monitor GPU memory usage.

Docker and Kubernetes are the production backbone of ML infrastructure — enabling reproducible deployments, horizontal scaling, and robust operations that transform ML experiments into reliable production systems.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT