god class detection, code ai
Find classes doing too much.
355 technical terms and definitions
Find classes doing too much.
High-quality reference annotations.
Use gold wire.
Best-performing chamber used as reference.
Reference wafer with known good properties for calibration.
Goodbye! Feel free to come back anytime with questions about AI, chips, or LLMs.
Goodness-of-fit tests compare observed distributions to theoretical expectations.
DeepMind's large language model.
LLM trained to use APIs and tools effectively.
Gorilla is LLM trained for API calling. Reduces hallucination on APIs.
Gowning procedures specify sequence for donning cleanroom apparel.
Protocols for cleanroom garments.
Area where workers change into cleanroom suits before entering fab.
Autoregressive model family for text generation.
GPT Engineer generates codebases from specs. Agentic coding.
GPT-J is EleutherAI 6B model. Popular open model.
GPT-NeoX is EleutherAI 20B model. Open source, open weights.
OpenAI's multimodal large language model.
OpenAI's multimodal GPT-4.
GPT4All runs local models on desktop. Nomic AI. Privacy-focused.
GPTQ is post-training quantization method. Uses calibration data. Popular for 4-bit models.
Parallel processor widely used for deep learning.
Interconnected GPUs for parallel training.
Remote direct memory access for GPUs.
NVIDIA technology for direct data transfer.
Levels of memory on GPU.
How much GPU memory is used.
NVIDIA GPU Operator manages GPU drivers and plugins in Kubernetes. Required for GPU workloads on k8s.
Percentage of GPU compute actually being used.
# GPU (Graphics Processing Unit)
## Graphics Processing Unit
- **GPU (Graphics Processing Unit)**: A specialized processor designed for parallel processing tasks
- **GPUs**: Plural form of GPU
- **Graphics Card**: Physical hardware component containing a GPU, VRAM, and cooling system
- **Accelerator**: Specialized hardware that offloads computation from the CPU
## Architecture Fundamentals
### Core Components
- **Streaming Multiprocessors (SMs)**: Contain multiple CUDA cores for parallel execution
- **VRAM (Video RAM)**: High-bandwidth memory dedicated to the GPU
- **Memory Bus**: Data pathway between GPU and VRAM
- **PCIe Interface**: Connection to the motherboard/CPU
### Parallelism Model
GPUs excel at **SIMD** (Single Instruction, Multiple Data) operations:
$$
\text{Speedup} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}} \leq \frac{1}{(1-P) + \frac{P}{N}}
$$
Where:
- $P$ = Parallelizable fraction of code
- $N$ = Number of parallel processors
- This is **Amdahl's Law**
## Performance Metrics
### FLOPS (Floating Point Operations Per Second)
$$
\text{FLOPS} = \text{Cores} \times \text{Clock Speed (Hz)} \times \text{FLOPs per cycle}
$$
Example calculation for a GPU with 10,000 cores at 2 GHz:
$$
\text{FLOPS} = 10{,}000 \times 2 \times 10^9 \times 2 = 40 \text{ TFLOPS}
$$
### Memory Bandwidth
$$
\text{Bandwidth (GB/s)} = \frac{\text{Memory Clock (Hz)} \times \text{Bus Width (bits)} \times \text{Data Rate}}{8 \times 10^9}
$$
### Arithmetic Intensity
$$
\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}
$$
The **Roofline Model** bounds performance:
$$
\text{Attainable FLOPS} = \min\left(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity}\right)
$$
## GPU Computing Concepts
### Thread Hierarchy (CUDA Model)
- **Thread**: Smallest unit of execution
- Each thread has unique indices: `threadIdx.x`, `threadIdx.y`, `threadIdx.z`
- **Block**: Group of threads that can cooperate
- Shared memory accessible within block
- Maximum threads per block: typically 1024
- **Grid**: Collection of blocks
- Total threads: $\text{Grid Size} \times \text{Block Size}$
### Memory Hierarchy
| Memory Type | Scope | Latency | Size |
|-------------|-------|---------|------|
| Registers | Thread | ~1 cycle | ~256 KB total |
| Shared Memory | Block | ~5 cycles | 48-164 KB |
| L1 Cache | SM | ~30 cycles | 128 KB |
| L2 Cache | Device | ~200 cycles | 4-50 MB |
| Global Memory (VRAM) | Device | ~400 cycles | 8-80 GB |
## Matrix Operations Key for AI/ML
### Matrix Multiplication Complexity
Standard matrix multiplication for $A_{m \times k} \cdot B_{k \times n}$:
$$
C_{ij} = \sum_{l=1}^{k} A_{il} \cdot B_{lj}
$$
- **Time Complexity**: $O(m \times n \times k)$
- **Naive**: $O(n^3)$ for square matrices
- **Strassen's Algorithm**: $O(n^{2.807})$
### Tensor Core Operations
Mixed-precision matrix multiply-accumulate:
$$
D = A \times B + C
$$
Where:
- $A, B$ are FP16 (16-bit floating point)
- $C, D$ are FP32 (32-bit floating point)
Throughput comparison:
- **FP32 CUDA Cores**: ~40 TFLOPS
- **FP16 Tensor Cores**: ~300+ TFLOPS
- **INT8 Tensor Cores**: ~600+ TFLOPS
## Power and Thermal Equations
### Thermal Design Power (TDP)
$$
P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f
$$
Where:
- $\alpha$ = Activity factor
- $C$ = Capacitance
- $V$ = Voltage
- $f$ = Frequency
### Temperature Relationship
$$
T_{\text{junction}} = T_{\text{ambient}} + (P \times R_{\theta})
$$
Where $R_{\theta}$ is thermal resistance in °C/W.
## Deep Learning Operations
### Convolution (CNN)
For a 2D convolution with input $I$, kernel $K$, output $O$:
$$
O(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m,n)
$$
Output dimensions:
$$
O_{\text{size}} = \left\lfloor \frac{I_{\text{size}} - K_{\text{size}} + 2P}{S} \right\rfloor + 1
$$
Where:
- $P$ = Padding
- $S$ = Stride
### Attention Mechanism (Transformers)
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Memory complexity: $O(n^2 \cdot d)$ where $n$ is sequence length.
## Major GPU Vendors
### NVIDIA
- **Gaming**: GeForce RTX series
- **Professional**: Quadro / RTX A-series
- **Data Center**: A100, H100, H200, B100, B200
- **CUDA Ecosystem**: Dominant in AI/ML
### AMD
- **Gaming**: Radeon RX series
- **Data Center**: Instinct MI series (MI300X)
- **ROCm**: Open-source GPU computing platform
### Intel
- **Consumer**: Arc A-series
- **Data Center**: Gaudi accelerators, Max series
## Code Example: CUDA Kernel
```cuda
// Vector addition kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
// Launch configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<
Reasoning about visual scenes.
Graceful degradation maintains partial functionality when components fail.
System maintains functionality when components fail.
Graclus pooling uses deterministic graph coarsening algorithm for hierarchical graph classification.
Visualize important regions for predictions.
Gradient-weighted Class Activation Mapping highlights image regions important for CNN predictions.
Improved version of GradCAM.
Simulate larger batches.
Virtual batch size multiplication.
Gradient accumulation sums gradients over mini-batches before update. Simulates larger batch size when GPU memory is limited.
Gradient accumulation sums gradients over multiple microbatches before update. Simulates larger batch size.
Accumulate gradients over multiple mini-batches before updating weights.
Boosted trees for identifying defects.
Gradient boosting builds trees sequentially. XGBoost, LightGBM, CatBoost.
Group gradients for efficient communication.
Center gradients to improve training.
Gradient clipping bounds gradient norms preventing privacy leakage and training instability.
Gradient clipping limits gradient magnitude. Prevents exploding gradients. Typical max norm 1.0.
Cap gradient magnitude to prevent exploding gradients.
Compress gradients while preserving privacy.