← Back to AI Factory Chat

AI Factory Glossary

355 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 4 of 8 (355 entries)

god class detection, code ai

Find classes doing too much.

gold standard,data quality

High-quality reference annotations.

gold wire bonding, packaging

Use gold wire.

golden chamber,production

Best-performing chamber used as reference.

golden wafer,metrology

Reference wafer with known good properties for calibration.

goodbye,bye,farewell

Goodbye! Feel free to come back anytime with questions about AI, chips, or LLMs.

goodness-of-fit, quality & reliability

Goodness-of-fit tests compare observed distributions to theoretical expectations.

gopher,foundation model

DeepMind's large language model.

gorilla,ai agent

LLM trained to use APIs and tools effectively.

gorilla,api,calling

Gorilla is LLM trained for API calling. Reduces hallucination on APIs.

gowning procedure, manufacturing operations

Gowning procedures specify sequence for donning cleanroom apparel.

gowning procedures, facility

Protocols for cleanroom garments.

gowning room,facility

Area where workers change into cleanroom suits before entering fab.

gpt (generative pre-trained transformer),gpt,generative pre-trained transformer,foundation model

Autoregressive model family for text generation.

gpt engineer,code,generate

GPT Engineer generates codebases from specs. Agentic coding.

gpt j,eleuther,6b

GPT-J is EleutherAI 6B model. Popular open model.

gpt neox,eleuther,20b

GPT-NeoX is EleutherAI 20B model. Open source, open weights.

gpt-4,foundation model

OpenAI's multimodal large language model.

gpt-4v (gpt-4 vision),gpt-4v,gpt-4 vision,foundation model

OpenAI's multimodal GPT-4.

gpt4all,local,desktop

GPT4All runs local models on desktop. Nomic AI. Privacy-focused.

gptq,quantization,method

GPTQ is post-training quantization method. Uses calibration data. Popular for 4-bit models.

gpu (graphics processing unit),gpu,graphics processing unit,hardware

Parallel processor widely used for deep learning.

gpu clusters for training, gpu, infrastructure

Interconnected GPUs for parallel training.

gpu direct rdma, gpu, infrastructure

Remote direct memory access for GPUs.

gpu direct, gpu, infrastructure

NVIDIA technology for direct data transfer.

gpu memory hierarchy, gpu, hardware

Levels of memory on GPU.

gpu memory utilization, gpu, optimization

How much GPU memory is used.

gpu operator,device plugin,nvidia

NVIDIA GPU Operator manages GPU drivers and plugins in Kubernetes. Required for GPU workloads on k8s.

gpu utilization,optimization

Percentage of GPU compute actually being used.

gpu, gpus, graphics card, accelerator, parallel processing, cuda, opencl, graphics processing unit, compute

# GPU (Graphics Processing Unit) ## Graphics Processing Unit - **GPU (Graphics Processing Unit)**: A specialized processor designed for parallel processing tasks - **GPUs**: Plural form of GPU - **Graphics Card**: Physical hardware component containing a GPU, VRAM, and cooling system - **Accelerator**: Specialized hardware that offloads computation from the CPU ## Architecture Fundamentals ### Core Components - **Streaming Multiprocessors (SMs)**: Contain multiple CUDA cores for parallel execution - **VRAM (Video RAM)**: High-bandwidth memory dedicated to the GPU - **Memory Bus**: Data pathway between GPU and VRAM - **PCIe Interface**: Connection to the motherboard/CPU ### Parallelism Model GPUs excel at **SIMD** (Single Instruction, Multiple Data) operations: $$ \text{Speedup} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}} \leq \frac{1}{(1-P) + \frac{P}{N}} $$ Where: - $P$ = Parallelizable fraction of code - $N$ = Number of parallel processors - This is **Amdahl's Law** ## Performance Metrics ### FLOPS (Floating Point Operations Per Second) $$ \text{FLOPS} = \text{Cores} \times \text{Clock Speed (Hz)} \times \text{FLOPs per cycle} $$ Example calculation for a GPU with 10,000 cores at 2 GHz: $$ \text{FLOPS} = 10{,}000 \times 2 \times 10^9 \times 2 = 40 \text{ TFLOPS} $$ ### Memory Bandwidth $$ \text{Bandwidth (GB/s)} = \frac{\text{Memory Clock (Hz)} \times \text{Bus Width (bits)} \times \text{Data Rate}}{8 \times 10^9} $$ ### Arithmetic Intensity $$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}} $$ The **Roofline Model** bounds performance: $$ \text{Attainable FLOPS} = \min\left(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity}\right) $$ ## GPU Computing Concepts ### Thread Hierarchy (CUDA Model) - **Thread**: Smallest unit of execution - Each thread has unique indices: `threadIdx.x`, `threadIdx.y`, `threadIdx.z` - **Block**: Group of threads that can cooperate - Shared memory accessible within block - Maximum threads per block: typically 1024 - **Grid**: Collection of blocks - Total threads: $\text{Grid Size} \times \text{Block Size}$ ### Memory Hierarchy | Memory Type | Scope | Latency | Size | |-------------|-------|---------|------| | Registers | Thread | ~1 cycle | ~256 KB total | | Shared Memory | Block | ~5 cycles | 48-164 KB | | L1 Cache | SM | ~30 cycles | 128 KB | | L2 Cache | Device | ~200 cycles | 4-50 MB | | Global Memory (VRAM) | Device | ~400 cycles | 8-80 GB | ## Matrix Operations Key for AI/ML ### Matrix Multiplication Complexity Standard matrix multiplication for $A_{m \times k} \cdot B_{k \times n}$: $$ C_{ij} = \sum_{l=1}^{k} A_{il} \cdot B_{lj} $$ - **Time Complexity**: $O(m \times n \times k)$ - **Naive**: $O(n^3)$ for square matrices - **Strassen's Algorithm**: $O(n^{2.807})$ ### Tensor Core Operations Mixed-precision matrix multiply-accumulate: $$ D = A \times B + C $$ Where: - $A, B$ are FP16 (16-bit floating point) - $C, D$ are FP32 (32-bit floating point) Throughput comparison: - **FP32 CUDA Cores**: ~40 TFLOPS - **FP16 Tensor Cores**: ~300+ TFLOPS - **INT8 Tensor Cores**: ~600+ TFLOPS ## Power and Thermal Equations ### Thermal Design Power (TDP) $$ P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f $$ Where: - $\alpha$ = Activity factor - $C$ = Capacitance - $V$ = Voltage - $f$ = Frequency ### Temperature Relationship $$ T_{\text{junction}} = T_{\text{ambient}} + (P \times R_{\theta}) $$ Where $R_{\theta}$ is thermal resistance in °C/W. ## Deep Learning Operations ### Convolution (CNN) For a 2D convolution with input $I$, kernel $K$, output $O$: $$ O(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m,n) $$ Output dimensions: $$ O_{\text{size}} = \left\lfloor \frac{I_{\text{size}} - K_{\text{size}} + 2P}{S} \right\rfloor + 1 $$ Where: - $P$ = Padding - $S$ = Stride ### Attention Mechanism (Transformers) $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Memory complexity: $O(n^2 \cdot d)$ where $n$ is sequence length. ## Major GPU Vendors ### NVIDIA - **Gaming**: GeForce RTX series - **Professional**: Quadro / RTX A-series - **Data Center**: A100, H100, H200, B100, B200 - **CUDA Ecosystem**: Dominant in AI/ML ### AMD - **Gaming**: Radeon RX series - **Data Center**: Instinct MI series (MI300X) - **ROCm**: Open-source GPU computing platform ### Intel - **Consumer**: Arc A-series - **Data Center**: Gaudi accelerators, Max series ## Code Example: CUDA Kernel ```cuda // Vector addition kernel __global__ void vectorAdd(float *A, float *B, float *C, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) { C[idx] = A[idx] + B[idx]; } } // Launch configuration int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; vectorAdd<<>>(d_A, d_B, d_C, N); ``` ## Formulas | Metric | Formula | |--------|---------| | Thread Index (1D) | $\text{idx} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$ | | Memory Bandwidth | $BW = \frac{\text{Clock} \times \text{Width} \times 2}{8}$ GB/s | | FLOPS | $\text{Cores} \times \text{Freq} \times \text{FMA}$ | | Power Efficiency | $\frac{\text{TFLOPS}}{\text{Watts}}$ | | Utilization | $\frac{\text{Active Warps}}{\text{Max Warps}} \times 100\%$ | ## Architecture - NVIDIA CUDA - AMD ROCm

gqa (general question answering),gqa,general question answering,evaluation

Reasoning about visual scenes.

graceful degradation, llm optimization

Graceful degradation maintains partial functionality when components fail.

graceful degradation,reliability

System maintains functionality when components fail.

graclus pooling, graph neural networks

Graclus pooling uses deterministic graph coarsening algorithm for hierarchical graph classification.

gradcam, explainable ai

Visualize important regions for predictions.

gradcam, interpretability

Gradient-weighted Class Activation Mapping highlights image regions important for CNN predictions.

gradcam++, explainable ai

Improved version of GradCAM.

gradient accumulation in vit, computer vision

Simulate larger batches.

gradient accumulation steps, optimization

Virtual batch size multiplication.

gradient accumulation,effective batch

Gradient accumulation sums gradients over mini-batches before update. Simulates larger batch size when GPU memory is limited.

gradient accumulation,microbatch

Gradient accumulation sums gradients over multiple microbatches before update. Simulates larger batch size.

gradient accumulation,model training

Accumulate gradients over multiple mini-batches before updating weights.

gradient boosting for defect detection, data analysis

Boosted trees for identifying defects.

gradient boosting,xgboost,lgbm

Gradient boosting builds trees sequentially. XGBoost, LightGBM, CatBoost.

gradient bucketing, distributed training

Group gradients for efficient communication.

gradient centralization, optimization

Center gradients to improve training.

gradient clipping, training techniques

Gradient clipping bounds gradient norms preventing privacy leakage and training instability.

gradient clipping,max norm,stability

Gradient clipping limits gradient magnitude. Prevents exploding gradients. Typical max norm 1.0.

gradient clipping,model training

Cap gradient magnitude to prevent exploding gradients.

gradient compression for privacy, privacy

Compress gradients while preserving privacy.