Cloud platforms for AI/ML provide on-demand GPU compute and managed services for training and deploying machine learning models — offering instances with A100s, H100s, and other accelerators alongside managed ML platforms like SageMaker, Vertex AI, and Azure ML, enabling teams to scale AI workloads without owning hardware.
Why Cloud for AI/ML?
- No Capital Investment: Pay for GPUs as needed, no $40K H100 purchases.
- Elastic Scale: Scale from 0 to 1000 GPUs for training, back to 0.
- Managed Services: Training, serving, monitoring handled by platform.
- Latest Hardware: Access H100s, H200s as they release.
- Global Availability: Deploy close to users worldwide.
GPU Instance Comparison
High-End Training Instances:
```
Instance | GPUs | GPU Memory| $/hr (On-Demand)
------------------|-----------|-----------|------------------
AWS p5.48xlarge | 8× H100 | 640 GB | ~$98
GCP a3-megagpu-8g | 8× H100 | 640 GB | ~$100
Azure ND H100 v5 | 8× H100 | 640 GB | ~$98
Lambda Cloud 8xH100| 8× H100 | 640 GB | ~$85
Inference Instances:
``
Instance | GPUs | GPU Memory| $/hr (On-Demand)
------------------|-----------|-----------|------------------
AWS g5.xlarge | 1× A10G | 24 GB | ~$1.00
GCP g2-standard-4 | 1× L4 | 24 GB | ~$0.70
Azure NC A100 v4 | 1× A100 | 80 GB | ~$3.67
AWS inf2.xlarge | 1× Inferentia2| 32 GB | ~$0.75
Cost Optimization
Spot/Preemptible Instances:
``
Type | Discount | Risk | Use For
--------------|----------|-----------------|------------------
Spot (AWS) | 60-90% | Interruption | Training w/checkpoints
Preemptible | 60-80% | 24hr max | Batch jobs
Spot Block | 30-50% | 1-6hr guaranteed| Short jobs
Reserved/Committed:
``
Commitment | Discount | Best For
--------------|----------|------------------
1-year | 30-40% | Steady inference workloads
3-year | 50-60% | Long-term production
PAYG fallback | 0% | Burst capacity
Managed ML Services
AWS SageMaker:
``
Component | Purpose
--------------|----------------------------------
Studio | IDE for ML development
Training | Managed training jobs
Endpoints | Model serving
Pipelines | ML workflow orchestration
Ground Truth | Data labeling
GCP Vertex AI:
``
Component | Purpose
---------------|----------------------------------
Workbench | Managed notebooks
Training | Distributed training
Prediction | Serving endpoints
Pipelines | Kubeflow-based workflows
Feature Store | ML feature management
Azure Machine Learning:
``
Component | Purpose
---------------|----------------------------------
Designer | Drag-and-drop ML
AutoML | Automated model selection
Compute | Managed clusters
Endpoints | Deployment targets
MLflow | Experiment tracking
Decision Framework
``
Use Case | Provider Strength
--------------------------|------------------
Existing AWS shop | SageMaker
Google ecosystem | Vertex AI
Microsoft shop | Azure ML
Cost-sensitive | Lambda, RunPod, Vast.ai
Simplest experience | Replicate, Modal
Maximum control | Raw GPU instances
Storage Options
``
Service | Provider | Use Case | Cost
---------------|----------|--------------------|---------
S3 | AWS | Datasets, artifacts| $0.023/GB
GCS | GCP | Same | $0.020/GB
Azure Blob | Azure | Same | $0.018/GB
EFS/Filestore | Various | Shared model access| Higher
FSx for Lustre | AWS | High-perf training | $0.14/GB/mo
Cloud Architecture for LLM Training
``
┌─────────────────────────────────────────────────────┐
│ Object Storage (S3/GCS) │
│ ├── /datasets (tokenized training data) │
│ ├── /checkpoints (model snapshots) │
│ └── /final-models (trained models) │
├─────────────────────────────────────────────────────┤
│ Training Cluster │
│ └── 8×H100 nodes with fast interconnect │
│ (NVLink, InfiniBand) │
├─────────────────────────────────────────────────────┤
│ Serving Fleet │
│ ├── Autoscaling GPU instances │
│ ├── Load balancer │
│ └── CDN for static assets │
└─────────────────────────────────────────────────────┘
Quick Starts
AWS (Launch GPU instance):
`bash`
aws ec2 run-instances \
--image-id ami-xxx \
--instance-type p4d.24xlarge \
--key-name my-key
GCP (Create GPU instance):
`bash``
gcloud compute instances create gpu-instance \
--zone=us-central1-a \
--machine-type=a2-highgpu-1g \
--accelerator=type=nvidia-tesla-a100,count=1
Cloud platforms are the infrastructure foundation for AI at scale — providing the elastic GPU compute and managed services that enable teams to train frontier models and deploy production AI systems without massive capital investment.