CoreWeave

CoreWeave is the specialized AI cloud infrastructure provider delivering massive-scale NVIDIA GPU compute with InfiniBand-connected clusters purpose-built for distributed training of frontier models — serving as the compute backbone for companies like OpenAI, Cohere, and Mistral who require enterprise-grade reliability and networking performance that general-purpose cloud providers cannot match.

What Is CoreWeave?

- Definition: A specialized cloud provider founded in 2017 (originally as a crypto mining company) that pivoted to become the primary GPU cloud for enterprise AI training — operating data centers purpose-built for GPU-intensive workloads with InfiniBand fabric connecting thousands of GPUs.
- Scale: One of the largest NVIDIA H100 and H200 operators outside of hyperscalers — with data center agreements covering tens of thousands of H100s and emerging H200/Blackwell capacity.
- Positioning: The "AI hyperscaler" — positioned between enterprise cloud providers (AWS/GCP/Azure) and consumer GPU marketplaces (RunPod/Vast.ai), with data center-grade hardware, enterprise SLAs, and purpose-built AI networking.
- Key Differentiator: InfiniBand networking between GPUs rather than Ethernet — enabling near-native GPU-to-GPU communication speeds critical for all-reduce operations during multi-node training of 70B+ parameter models.
- Customers: OpenAI, Cohere, Character.ai, Mistral, and major AI labs — CoreWeave is the compute backbone for many frontier AI development efforts.

Why CoreWeave Matters for AI

- H100 Availability During Shortage: When AWS and Azure had 6-12 month waitlists for H100 capacity in 2023-2024, CoreWeave maintained availability — critical for AI companies racing to train models on schedule.
- InfiniBand Fabric: 400Gbps NDR InfiniBand connects GPUs in CoreWeave clusters — enabling all-reduce collective operations at memory bandwidth speeds versus 10-25Gbps typical Ethernet networking.
- Enterprise Reliability: 99.9%+ SLA, redundant power, enterprise networking — suitable for production workloads unlike consumer GPU marketplaces that depend on hobbyist hardware.
- NVIDIA Partnership: CoreWeave is an NVIDIA-preferred cloud partner with early access to new hardware (H200, Blackwell B100/B200) — customers get next-generation GPUs before hyperscalers deploy them at scale.
- Kubernetes-Native: CoreWeave runs on standard Kubernetes — teams deploy standard K8s manifests and Helm charts for training jobs, inference servers, and workflow orchestration without proprietary abstractions.

CoreWeave Infrastructure

GPU Portfolio:
- NVIDIA H100 SXM5 (80GB HBM3): Flagship training GPU, NVLink within node, InfiniBand between nodes
- NVIDIA H200 (141GB HBM3e): Next-gen with 80% more memory bandwidth than H100
- NVIDIA A100 (40GB/80GB): Previous generation, cost-effective for smaller-scale training
- NVIDIA RTX A6000 (48GB): Inference and visualization workloads

Networking:
- HDR InfiniBand (200Gbps) or NDR InfiniBand (400Gbps) between nodes
- GPUDirect RDMA: GPU-to-GPU data transfer bypassing CPU for maximum bandwidth
- Rail-optimized topology: Minimize network hops for all-reduce in FSDP and Megatron training

Storage:
- WekaFS: High-performance parallel file system for streaming training data to GPUs
- S3-compatible object storage for model artifacts and datasets
- NFS persistent volumes for model checkpoints and experiment outputs

Use Cases

Large-Scale Pre-Training:
- Multi-node training of 7B to 405B+ parameter models
- Megatron-LM / DeepSpeed ZeRO-3 on 64-512+ GPU clusters
- InfiniBand enables near-linear scaling efficiency across nodes

Production Inference:
- Deploy vLLM, TensorRT-LLM on dedicated H100 nodes with autoscaling
- Kubernetes-based scaling for variable traffic patterns
- Low-latency inference with dedicated GPU allocation (no shared tenancy)

Fine-Tuning at Scale:
- LoRA / QLoRA fine-tuning on single or multi-node clusters
- Axolotl, LLaMA-Factory, PEFT on CoreWeave with persistent checkpoint storage

CoreWeave vs Alternatives

| Provider | Scale | Networking | SLA | Price | Best For |
|----------|-------|-----------|-----|-------|---------|
| CoreWeave | Very High | InfiniBand | Enterprise | Medium | Large-scale training |
| AWS | Hyperscale | EFA (100Gbps) | Enterprise | High | Compliance, ecosystem |
| GCP | Hyperscale | ICI (TPU pods) | Enterprise | High | Google/Vertex ecosystem |
| Lambda Labs | Medium | Ethernet | High | Low | Research, smaller runs |
| RunPod | Low-Medium | Ethernet | Medium | Low | Budget training |

CoreWeave is the purpose-built AI hyperscaler providing InfiniBand-connected GPU infrastructure for training frontier models — by building data centers optimized for GPU-to-GPU communication rather than general-purpose workloads, CoreWeave enables distributed training at scale that defines the frontier of AI capability.

Want to learn more?