Cloud platforms for AI/ML

Cloud platforms for AI/ML provide on-demand GPU compute and managed services for training and deploying machine learning models — offering instances with A100s, H100s, and other accelerators alongside managed ML platforms like SageMaker, Vertex AI, and Azure ML, enabling teams to scale AI workloads without owning hardware.

Why Cloud for AI/ML?

- No Capital Investment: Pay for GPUs as needed, no $40K H100 purchases.
- Elastic Scale: Scale from 0 to 1000 GPUs for training, back to 0.
- Managed Services: Training, serving, monitoring handled by platform.
- Latest Hardware: Access H100s, H200s as they release.
- Global Availability: Deploy close to users worldwide.

GPU Instance Comparison

High-End Training Instances:
``Instance | GPUs | GPU Memory| $/hr (On-Demand) ------------------|-----------|-----------|------------------ AWS p5.48xlarge | 8× H100 | 640 GB | ~$98 GCP a3-megagpu-8g | 8× H100 | 640 GB | ~$100 Azure ND H100 v5 | 8× H100 | 640 GB | ~$98 Lambda Cloud 8xH100| 8× H100 | 640 GB | ~$85`

Inference Instances:`Instance | GPUs | GPU Memory| $/hr (On-Demand) ------------------|-----------|-----------|------------------ AWS g5.xlarge | 1× A10G | 24 GB | ~$1.00 GCP g2-standard-4 | 1× L4 | 24 GB | ~$0.70 Azure NC A100 v4 | 1× A100 | 80 GB | ~$3.67 AWS inf2.xlarge | 1× Inferentia2| 32 GB | ~$0.75`

Cost Optimization

Spot/Preemptible Instances:`Type | Discount | Risk | Use For --------------|----------|-----------------|------------------ Spot (AWS) | 60-90% | Interruption | Training w/checkpoints Preemptible | 60-80% | 24hr max | Batch jobs Spot Block | 30-50% | 1-6hr guaranteed| Short jobs`

Reserved/Committed:`Commitment | Discount | Best For --------------|----------|------------------ 1-year | 30-40% | Steady inference workloads 3-year | 50-60% | Long-term production PAYG fallback | 0% | Burst capacity`

Managed ML Services

AWS SageMaker:`Component | Purpose --------------|---------------------------------- Studio | IDE for ML development Training | Managed training jobs Endpoints | Model serving Pipelines | ML workflow orchestration Ground Truth | Data labeling`

GCP Vertex AI:`Component | Purpose ---------------|---------------------------------- Workbench | Managed notebooks Training | Distributed training Prediction | Serving endpoints Pipelines | Kubeflow-based workflows Feature Store | ML feature management`

Azure Machine Learning:`Component | Purpose ---------------|---------------------------------- Designer | Drag-and-drop ML AutoML | Automated model selection Compute | Managed clusters Endpoints | Deployment targets MLflow | Experiment tracking`

Decision Framework

`Use Case | Provider Strength --------------------------|------------------ Existing AWS shop | SageMaker Google ecosystem | Vertex AI Microsoft shop | Azure ML Cost-sensitive | Lambda, RunPod, Vast.ai Simplest experience | Replicate, Modal Maximum control | Raw GPU instances`

Storage Options

`Service | Provider | Use Case | Cost ---------------|----------|--------------------|--------- S3 | AWS | Datasets, artifacts| $0.023/GB GCS | GCP | Same | $0.020/GB Azure Blob | Azure | Same | $0.018/GB EFS/Filestore | Various | Shared model access| Higher FSx for Lustre | AWS | High-perf training | $0.14/GB/mo`

Cloud Architecture for LLM Training

`┌─────────────────────────────────────────────────────┐ │ Object Storage (S3/GCS) │ │ ├── /datasets (tokenized training data) │ │ ├── /checkpoints (model snapshots) │ │ └── /final-models (trained models) │ ├─────────────────────────────────────────────────────┤ │ Training Cluster │ │ └── 8×H100 nodes with fast interconnect │ │ (NVLink, InfiniBand) │ ├─────────────────────────────────────────────────────┤ │ Serving Fleet │ │ ├── Autoscaling GPU instances │ │ ├── Load balancer │ │ └── CDN for static assets │ └─────────────────────────────────────────────────────┘`

Quick Starts

AWS (Launch GPU instance):`bash aws ec2 run-instances \ --image-id ami-xxx \ --instance-type p4d.24xlarge \ --key-name my-key`

GCP (Create GPU instance):`bash gcloud compute instances create gpu-instance \ --zone=us-central1-a \ --machine-type=a2-highgpu-1g \ --accelerator=type=nvidia-tesla-a100,count=1``

Cloud platforms are the infrastructure foundation for AI at scale — providing the elastic GPU compute and managed services that enable teams to train frontier models and deploy production AI systems without massive capital investment.

Want to learn more?