Performance optimization for ML systems encompasses systematic approaches to improving speed, efficiency, and resource utilization — profiling to identify bottlenecks, applying targeted optimizations like vectorization, batching, caching, and GPU tuning, enabling faster training, lower inference latency, and reduced costs.
Why Performance Matters
- User Experience: Faster responses improve satisfaction.
- Cost: Efficient code uses fewer resources.
- Scale: Optimization enables handling more load.
- Iteration Speed: Faster training means more experiments.
- Competitive: Speed is often a differentiator.
Golden Rule: Profile First
Never Optimize Without Data:
# Python profiling
import cProfile
cProfile.run("main()", sort="cumtime")
# Line-by-line profiling
# pip install line_profiler
@profile
def my_function():
# code here
pass
# Run: kernprof -l -v script.py
Memory Profiling:
# pip install memory_profiler
from memory_profiler import profile
@profile
def my_function():
large_list = [x for x in range(1000000)]
return sum(large_list)
GPU Profiling:
# NVIDIA tools
nvidia-smi dmon -s u # Utilization over time
nsys profile python train.py # Detailed trace
Common Bottlenecks & Solutions
Slow Loops:
# ❌ Slow: Python loop
result = []
for x in data:
result.append(x * 2)
# ✅ Fast: Vectorized with NumPy
result = data * 2
# ✅ Fast: List comprehension (for non-numeric)
result = [x * 2 for x in data]
Memory Issues:
# ❌ Bad: Load entire file
with open("huge_file.csv") as f:
data = f.readlines() # All in memory
# ✅ Good: Generator/streaming
def read_chunks(file_path, chunk_size=1000):
with open(file_path) as f:
while True:
chunk = list(itertools.islice(f, chunk_size))
if not chunk:
break
yield chunk
I/O Bottlenecks:
# ❌ Sequential requests
results = []
for url in urls:
results.append(requests.get(url))
# ✅ Concurrent requests
import asyncio
import aiohttp
async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
return await asyncio.gather(*tasks)
LLM-Specific Optimizations
Quantization:
# Load in 4-bit for faster inference
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=bnb_config
)
Batching:
# ❌ Process one at a time
for prompt in prompts:
response = llm.generate(prompt)
# ✅ Batch process
responses = llm.generate(prompts, batch_size=16)
Response Caching:
from functools import lru_cache
import hashlib
@lru_cache(maxsize=10000)
def cached_llm_call(prompt_hash):
return llm.generate(unhash(prompt_hash))
def call_with_cache(prompt):
h = hashlib.sha256(prompt.encode()).hexdigest()
return cached_llm_call(h)
Streaming:
# Stream for perceived speed
for chunk in llm.generate(prompt, stream=True):
print(chunk, end="", flush=True)
GPU Optimization
Maximize Utilization:
# Check current utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv
# Increase batch size until GPU is ~80-90% utilized
# Too low utilization = wasted GPU capacity
# Use mixed precision
with torch.autocast("cuda"):
output = model(input)
Memory Management:
# Clear cache when needed
torch.cuda.empty_cache()
# Delete unused tensors
del large_tensor
# Use gradient checkpointing
model.gradient_checkpointing_enable()
Data Loading:
# Use multiple workers for data loading
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Parallel loading
pin_memory=True, # Faster GPU transfer
prefetch_factor=2
)
Optimization Checklist
□ Profile before optimizing
□ Identify actual bottleneck (CPU, GPU, I/O, memory)
□ Apply targeted fix
□ Measure improvement
□ Check for regressions
□ Document changes
□ Repeat until goals met
Tools Summary
Purpose | Tool
----------------|---------------------------
Python profile | cProfile, line_profiler
Memory profile | memory_profiler, tracemalloc
GPU profile | nvidia-smi, nsys, PyTorch profiler
Web/API | locust, k6
Benchmarking | pytest-benchmark, timeit
Performance optimization is a systematic discipline, not guesswork — measuring before optimizing ensures effort is focused on actual bottlenecks, leading to real improvements rather than premature optimization that adds complexity without benefit.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.