Modal

Modal is the serverless cloud platform for Python that enables running GPU-accelerated AI workloads in the cloud by defining infrastructure requirements directly in Python code — eliminating Docker file complexity, environment management, and idle GPU costs by running containers on-demand and billing only for actual compute time.

What Is Modal?

- Definition: A serverless cloud platform where Python functions decorated with @app.function() are automatically containerized, deployed to the cloud, and executed on the specified hardware (CPU, GPU, accelerator) — with the cloud environment defined as Python code rather than YAML or Dockerfiles.
- Key Innovation: Infrastructure-as-Python-code — instead of writing Dockerfiles, Kubernetes manifests, or cloud console configurations, Modal users define their environment using Python APIs and run local scripts that transparently execute in the cloud.
- Serverless Model: No idle charges — Modal spins up containers when a function is called and tears them down when it completes. A fine-tuning job that takes 2 hours costs 2 hours of GPU time, not 24 hours because a server was provisioned overnight.
- Founded: 2021 by Erik Bernhardsson (formerly Spotify, Netflix) — designed specifically for the needs of ML engineers.

Why Modal Matters for AI Workloads

- GPU Access Without DevOps: ML researchers can access A100s, H100s, and L4s without managing Kubernetes, writing Dockerfiles, or configuring cloud IAM policies — define the environment in Python and run.
- Cold Start for ML: Modal pre-warms containers and caches container images — cold start for GPU containers is seconds rather than minutes, making serverless viable for latency-sensitive inference.
- Fine-Tuning Workflows: Run a LoRA fine-tuning job that needs 4 × A100s for 3 hours — Modal provisions exactly that, runs the job, persists checkpoints to Modal Volumes, and charges only for 3 GPU-hours.
- Batch Inference: Process 100,000 documents for embedding — Modal.map() parallelizes across many containers automatically, completing in minutes rather than hours.
- Scheduled Jobs: Run embedding pipeline updates, evaluation runs, or dataset processing on a schedule without managing cron infrastructure.

Core Modal Concepts

Defining Environments:
import modal

app = modal.App("my-llm-app")

# Define container image as Python code
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("torch", "transformers", "vllm", "accelerate")
.env({"HF_HOME": "/cache"})
)

GPU Functions:
@app.function(
image=image,
gpu="A100", # Request A100 GPU
memory=65536, # 64GB RAM
timeout=7200, # 2-hour timeout
volumes={"/cache": modal.Volume.from_name("model-cache")} # Persistent storage
)
def fine_tune(dataset_path: str, output_path: str):
# This code runs on A100 in the cloud
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
# ... fine-tuning code ...
model.save_pretrained(output_path)

# Run from local terminal — transparently executes on A100
with app.run():
fine_tune.remote("s3://bucket/dataset.jsonl", "/cache/model-v2")

Parallel Batch Processing:
@app.function(image=image, gpu="L4", concurrency_limit=20)
def embed_document(text: str) -> list[float]:
return embedding_model.encode(text)

with app.run():
# Automatically parallelizes across up to 20 containers
embeddings = list(embed_document.map(documents, order_outputs=True))

Web Endpoints:
@app.function(image=image, gpu="A10G")
@modal.web_endpoint(method="POST")
async def generate(request: dict) -> dict:
return {"response": model.generate(request["prompt"])}

# Deploy: modal deploy my_app.py
# Endpoint URL returned — autoscales from 0 to N based on traffic

Modal Storage

Modal Volumes: Persistent filesystem shared across function invocations — store model weights, datasets, checkpoints.

Modal Secrets: Encrypted key-value store for API keys, HuggingFace tokens, database credentials — referenced in function definitions without hardcoding.

modal.Secret.from_name("openai-api-key") # Injected as environment variable

Modal vs Alternatives

| Platform | Strength | Weakness |
|----------|---------|---------|
| Modal | Python-first, serverless, fast iteration | Newer, smaller community |
| RunPod | Cheaper for long jobs, flexible | Less developer-friendly API |
| Lambda Labs | Cheapest H100s, simple | No serverless; always-on billing |
| AWS SageMaker | Enterprise features, ecosystem | Complex, expensive, heavy |
| Google Colab | Free tier, Jupyter | Limited compute time, not production |

Modal is the platform that makes cloud GPU computing feel like local development — by collapsing the gap between writing code on a laptop and executing it on a 8×H100 cluster to a single Python decorator, Modal dramatically accelerates the iteration speed of AI research and production deployment workflows.

Want to learn more?