Home Knowledge Base CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API for GPU programming — enabling developers to leverage GPU hardware for general-purpose computing, CUDA is the foundation of modern AI/ML frameworks with extensive ecosystem support through cuDNN, cuBLAS, and integration with PyTorch and TensorFlow.

What Is CUDA?

Why CUDA Dominates AI

CUDA Architecture Concepts

Execution Model:

CPU (Host)              GPU (Device)
    │                        │
    ▼                        │
┌─────────┐                  │
│ Program │                  │
│ (Host)  │──── kernel ────▶ │
└─────────┘      launch      │
                             ▼
                    ┌─────────────────┐
                    │ Grid of Blocks  │
                    │  ┌───┬───┬───┐  │
                    │  │Blk│Blk│Blk│  │
                    │  ├───┼───┼───┤  │
                    │  │Blk│Blk│Blk│  │
                    │  └───┴───┴───┘  │
                    │                 │
                    │ Each block has  │
                    │ threads (32×)   │
                    └─────────────────┘

Hierarchy:

Level        | Unit          | Maps To
-------------|---------------|-------------------
Grid         | Full workload | Kernel launch
Block        | Thread group  | Streaming Multiprocessor
Thread       | Single worker | CUDA core
Warp         | 32 threads    | Execution unit

Simple CUDA Example

Vector Addition:

// Kernel definition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    // Calculate global thread ID
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Host code
int main() {
    int n = 1000000;
    float *d_a, *d_b, *d_c;
    
    // Allocate GPU memory
    cudaMalloc(&d_a, n * sizeof(float));
    cudaMalloc(&d_b, n * sizeof(float));
    cudaMalloc(&d_c, n * sizeof(float));
    
    // Copy data to GPU
    cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
    
    // Launch kernel
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);
    
    // Copy result back
    cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);
    
    // Free GPU memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}

CUDA Libraries

Key Libraries:

Library      | Purpose
-------------|----------------------------------
cuDNN        | Deep learning primitives
cuBLAS       | Linear algebra (BLAS)
cuFFT        | Fast Fourier transforms
cuSPARSE     | Sparse matrix operations
cuRAND       | Random number generation
NCCL         | Multi-GPU communication
TensorRT     | Inference optimization

Framework Integration:

Framework    | CUDA Usage
-------------|----------------------------------
PyTorch      | torch.cuda, automatic dispatch
TensorFlow   | GPU ops, XLA compilation
JAX          | XLA with CUDA backend
RAPIDS       | GPU data science

PyTorch CUDA Usage

import torch

# Check CUDA availability
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())

# Move tensor to GPU
x = torch.randn(1000, 1000)
x_gpu = x.cuda()  # or x.to("cuda")

# Operations on GPU
y_gpu = x_gpu @ x_gpu.T  # Matrix multiply on GPU

# Move back to CPU
y_cpu = y_gpu.cpu()

# Specify device
device = torch.device("cuda:0")
model = MyModel().to(device)

CUDA Versions

CUDA Version | Features                    | Driver
-------------|-----------------------------|---------
12.x         | Hopper support, async       | 525+
11.x         | Ampere, BF16, TF32          | 450+
10.x         | Turing, mixed precision     | 410+

Version Checking:

# CUDA toolkit version
nvcc --version

# Driver version
nvidia-smi

# PyTorch CUDA version
python -c "import torch; print(torch.version.cuda)"

CUDA is the essential infrastructure of AI computing — while alternatives exist, CUDA's maturity, optimization, and ecosystem integration make it the de facto standard for AI development, with most frameworks, models, and workflows assuming CUDA-enabled NVIDIA GPUs.

cudanvidiaprogramminggpukernelparallelcudnn

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.