Home Knowledge Base GPU (Graphics Processing Unit)

GPU (Graphics Processing Unit) is a specialized processor designed for parallel processing tasks


Architecture Fundamentals

Core Components

Parallelism Model

GPUs excel at SIMD (Single Instruction, Multiple Data) operations:

$$ \text{Speedup} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}} \leq \frac{1}{(1-P) + \frac{P}{N}} $$

Where:


Performance Metrics

FLOPS (Floating Point Operations Per Second)

$$ \text{FLOPS} = \text{Cores} \times \text{Clock Speed (Hz)} \times \text{FLOPs per cycle} $$

Example calculation for a GPU with 10,000 cores at 2 GHz:

$$ \text{FLOPS} = 10{,}000 \times 2 \times 10^9 \times 2 = 40 \text{ TFLOPS} $$

Memory Bandwidth

$$ \text{Bandwidth (GB/s)} = \frac{\text{Memory Clock (Hz)} \times \text{Bus Width (bits)} \times \text{Data Rate}}{8 \times 10^9} $$

Arithmetic Intensity

$$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}} $$

The Roofline Model bounds performance:

$$ \text{Attainable FLOPS} = \min\left(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity}\right) $$


GPU Computing Concepts

Thread Hierarchy (CUDA Model)

Memory Hierarchy

Memory TypeScopeLatencySize
RegistersThread~1 cycle~256 KB total
Shared MemoryBlock~5 cycles48-164 KB
L1 CacheSM~30 cycles128 KB
L2 CacheDevice~200 cycles4-50 MB
Global Memory (VRAM)Device~400 cycles8-80 GB

Matrix Operations (Key for AI/ML)

Matrix Multiplication Complexity

Standard matrix multiplication for $A_{m \times k} \cdot B_{k \times n}$:

$$ C_{ij} = \sum_{l=1}^{k} A_{il} \cdot B_{lj} $$

Tensor Core Operations

Mixed-precision matrix multiply-accumulate:

$$ D = A \times B + C $$

Where:

Throughput comparison:


Power and Thermal Equations

Thermal Design Power (TDP)

$$ P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f $$

Where:

Temperature Relationship

$$ T_{\text{junction}} = T_{\text{ambient}} + (P \times R_{\theta}) $$

Where $R_{\theta}$ is thermal resistance in °C/W.


Deep Learning Operations

Convolution (CNN)

For a 2D convolution with input $I$, kernel $K$, output $O$:

$$ O(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m,n) $$

Output dimensions:

$$ O_{\text{size}} = \left\lfloor \frac{I_{\text{size}} - K_{\text{size}} + 2P}{S} \right\rfloor + 1 $$

Where:

Attention Mechanism (Transformers)

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Memory complexity: $O(n^2 \cdot d)$ where $n$ is sequence length.


Major GPU Vendors

NVIDIA

AMD

Intel


Code Example: CUDA Kernel

// Vector addition kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

// Launch configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Quick Reference Formulas

MetricFormula
Thread Index (1D)$\text{idx} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$
Memory Bandwidth$BW = \frac{\text{Clock} \times \text{Width} \times 2}{8}$ GB/s
FLOPS$\text{Cores} \times \text{Freq} \times \text{FMA}$
Power Efficiency$\frac{\text{TFLOPS}}{\text{Watts}}$
Utilization$\frac{\text{Active Warps}}{\text{Max Warps}} \times 100\%$

References

gpugraphics processing unitvideo cardacceleratorcudahardwarecompute

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.