CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API for GPU programming — enabling developers to leverage GPU hardware for general-purpose computing, CUDA is the foundation of modern AI/ML frameworks with extensive ecosystem support through cuDNN, cuBLAS, and integration with PyTorch and TensorFlow.
What Is CUDA?
- Definition: Programming model and API for NVIDIA GPU computing.
- Purpose: General-purpose GPU (GPGPU) programming.
- Language: C/C++ extensions with CUDA-specific syntax.
- Ecosystem: Libraries, tools, frameworks built on CUDA.
Why CUDA Dominates AI
- First Mover: Launched 2006, decade+ head start.
- Ecosystem: Massive library and framework support.
- Optimization: Highly tuned for NVIDIA hardware.
- Community: Large developer base and resources.
- Lock-in: Most AI code assumes CUDA.
CUDA Architecture Concepts
Execution Model:
CPU (Host) GPU (Device)
│ │
▼ │
┌─────────┐ │
│ Program │ │
│ (Host) │──── kernel ────▶ │
└─────────┘ launch │
▼
┌─────────────────┐
│ Grid of Blocks │
│ ┌───┬───┬───┐ │
│ │Blk│Blk│Blk│ │
│ ├───┼───┼───┤ │
│ │Blk│Blk│Blk│ │
│ └───┴───┴───┘ │
│ │
│ Each block has │
│ threads (32×) │
└─────────────────┘
Hierarchy:
Level | Unit | Maps To
-------------|---------------|-------------------
Grid | Full workload | Kernel launch
Block | Thread group | Streaming Multiprocessor
Thread | Single worker | CUDA core
Warp | 32 threads | Execution unit
Simple CUDA Example
Vector Addition:
// Kernel definition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
// Calculate global thread ID
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
// Host code
int main() {
int n = 1000000;
float *d_a, *d_b, *d_c;
// Allocate GPU memory
cudaMalloc(&d_a, n * sizeof(float));
cudaMalloc(&d_b, n * sizeof(float));
cudaMalloc(&d_c, n * sizeof(float));
// Copy data to GPU
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);
// Copy result back
cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);
// Free GPU memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}
CUDA Libraries
Key Libraries:
Library | Purpose
-------------|----------------------------------
cuDNN | Deep learning primitives
cuBLAS | Linear algebra (BLAS)
cuFFT | Fast Fourier transforms
cuSPARSE | Sparse matrix operations
cuRAND | Random number generation
NCCL | Multi-GPU communication
TensorRT | Inference optimization
Framework Integration:
Framework | CUDA Usage
-------------|----------------------------------
PyTorch | torch.cuda, automatic dispatch
TensorFlow | GPU ops, XLA compilation
JAX | XLA with CUDA backend
RAPIDS | GPU data science
PyTorch CUDA Usage
import torch
# Check CUDA availability
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
# Move tensor to GPU
x = torch.randn(1000, 1000)
x_gpu = x.cuda() # or x.to("cuda")
# Operations on GPU
y_gpu = x_gpu @ x_gpu.T # Matrix multiply on GPU
# Move back to CPU
y_cpu = y_gpu.cpu()
# Specify device
device = torch.device("cuda:0")
model = MyModel().to(device)
CUDA Versions
CUDA Version | Features | Driver
-------------|-----------------------------|---------
12.x | Hopper support, async | 525+
11.x | Ampere, BF16, TF32 | 450+
10.x | Turing, mixed precision | 410+
Version Checking:
# CUDA toolkit version
nvcc --version
# Driver version
nvidia-smi
# PyTorch CUDA version
python -c "import torch; print(torch.version.cuda)"
CUDA is the essential infrastructure of AI computing — while alternatives exist, CUDA's maturity, optimization, and ecosystem integration make it the de facto standard for AI development, with most frameworks, models, and workflows assuming CUDA-enabled NVIDIA GPUs.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.