GPU (Graphics Processing Unit) is a specialized processor designed for parallel processing tasks
- GPUs: Plural form of GPU
- Graphics Card: Physical hardware component containing a GPU, VRAM, and cooling system
- Accelerator: Specialized hardware that offloads computation from the CPU
Architecture Fundamentals
Core Components
- Streaming Multiprocessors (SMs): Contain multiple CUDA cores for parallel execution
- VRAM (Video RAM): High-bandwidth memory dedicated to the GPU
- Memory Bus: Data pathway between GPU and VRAM
- PCIe Interface: Connection to the motherboard/CPU
Parallelism Model
GPUs excel at SIMD (Single Instruction, Multiple Data) operations:
$$ \text{Speedup} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}} \leq \frac{1}{(1-P) + \frac{P}{N}} $$
Where:
- $P$ = Parallelizable fraction of code
- $N$ = Number of parallel processors
- This is Amdahl's Law
Performance Metrics
FLOPS (Floating Point Operations Per Second)
$$ \text{FLOPS} = \text{Cores} \times \text{Clock Speed (Hz)} \times \text{FLOPs per cycle} $$
Example calculation for a GPU with 10,000 cores at 2 GHz:
$$ \text{FLOPS} = 10{,}000 \times 2 \times 10^9 \times 2 = 40 \text{ TFLOPS} $$
Memory Bandwidth
$$ \text{Bandwidth (GB/s)} = \frac{\text{Memory Clock (Hz)} \times \text{Bus Width (bits)} \times \text{Data Rate}}{8 \times 10^9} $$
Arithmetic Intensity
$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}} $$
The Roofline Model bounds performance:
$$ \text{Attainable FLOPS} = \min\left(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity}\right) $$
GPU Computing Concepts
Thread Hierarchy (CUDA Model)
- Thread: Smallest unit of execution
- Each thread has unique indices:
threadIdx.x,threadIdx.y,threadIdx.z - Block: Group of threads that can cooperate
- Shared memory accessible within block
- Maximum threads per block: typically 1024
- Grid: Collection of blocks
- Total threads: $\text{Grid Size} \times \text{Block Size}$
Memory Hierarchy
| Memory Type | Scope | Latency | Size |
|---|---|---|---|
| Registers | Thread | ~1 cycle | ~256 KB total |
| Shared Memory | Block | ~5 cycles | 48-164 KB |
| L1 Cache | SM | ~30 cycles | 128 KB |
| L2 Cache | Device | ~200 cycles | 4-50 MB |
| Global Memory (VRAM) | Device | ~400 cycles | 8-80 GB |
Matrix Operations (Key for AI/ML)
Matrix Multiplication Complexity
Standard matrix multiplication for $A_{m \times k} \cdot B_{k \times n}$:
$$ C_{ij} = \sum_{l=1}^{k} A_{il} \cdot B_{lj} $$
- Time Complexity: $O(m \times n \times k)$
- Naive: $O(n^3)$ for square matrices
- Strassen's Algorithm: $O(n^{2.807})$
Tensor Core Operations
Mixed-precision matrix multiply-accumulate:
$$ D = A \times B + C $$
Where:
- $A, B$ are FP16 (16-bit floating point)
- $C, D$ are FP32 (32-bit floating point)
Throughput comparison:
- FP32 CUDA Cores: ~40 TFLOPS
- FP16 Tensor Cores: ~300+ TFLOPS
- INT8 Tensor Cores: ~600+ TFLOPS
Power and Thermal Equations
Thermal Design Power (TDP)
$$ P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f $$
Where:
- $\alpha$ = Activity factor
- $C$ = Capacitance
- $V$ = Voltage
- $f$ = Frequency
Temperature Relationship
$$ T_{\text{junction}} = T_{\text{ambient}} + (P \times R_{\theta}) $$
Where $R_{\theta}$ is thermal resistance in °C/W.
Deep Learning Operations
Convolution (CNN)
For a 2D convolution with input $I$, kernel $K$, output $O$:
$$ O(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m,n) $$
Output dimensions:
$$ O_{\text{size}} = \left\lfloor \frac{I_{\text{size}} - K_{\text{size}} + 2P}{S} \right\rfloor + 1 $$
Where:
- $P$ = Padding
- $S$ = Stride
Attention Mechanism (Transformers)
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Memory complexity: $O(n^2 \cdot d)$ where $n$ is sequence length.
Major GPU Vendors
NVIDIA
- Gaming: GeForce RTX series
- Professional: Quadro / RTX A-series
- Data Center: A100, H100, H200, B100, B200
- CUDA Ecosystem: Dominant in AI/ML
AMD
- Gaming: Radeon RX series
- Data Center: Instinct MI series (MI300X)
- ROCm: Open-source GPU computing platform
Intel
- Consumer: Arc A-series
- Data Center: Gaudi accelerators, Max series
Code Example: CUDA Kernel
// Vector addition kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
// Launch configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
Quick Reference Formulas
| Metric | Formula |
|---|---|
| Thread Index (1D) | $\text{idx} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$ |
| Memory Bandwidth | $BW = \frac{\text{Clock} \times \text{Width} \times 2}{8}$ GB/s |
| FLOPS | $\text{Cores} \times \text{Freq} \times \text{FMA}$ |
| Power Efficiency | $\frac{\text{TFLOPS}}{\text{Watts}}$ |
| Utilization | $\frac{\text{Active Warps}}{\text{Max Warps}} \times 100\%$ |
References
- NVIDIA CUDA Programming Guide
- AMD ROCm Documentation
- Patterson & Hennessy, Computer Architecture
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.