Home Knowledge Base Performance optimization

Performance optimization for ML systems encompasses systematic approaches to improving speed, efficiency, and resource utilization — profiling to identify bottlenecks, applying targeted optimizations like vectorization, batching, caching, and GPU tuning, enabling faster training, lower inference latency, and reduced costs.

Why Performance Matters

Golden Rule: Profile First

Never Optimize Without Data:

# Python profiling
import cProfile
cProfile.run("main()", sort="cumtime")

# Line-by-line profiling
# pip install line_profiler
@profile
def my_function():
    # code here
    pass

# Run: kernprof -l -v script.py

Memory Profiling:

# pip install memory_profiler
from memory_profiler import profile

@profile
def my_function():
    large_list = [x for x in range(1000000)]
    return sum(large_list)

GPU Profiling:

# NVIDIA tools
nvidia-smi dmon -s u  # Utilization over time
nsys profile python train.py  # Detailed trace

Common Bottlenecks & Solutions

Slow Loops:

# ❌ Slow: Python loop
result = []
for x in data:
    result.append(x * 2)

# ✅ Fast: Vectorized with NumPy
result = data * 2

# ✅ Fast: List comprehension (for non-numeric)
result = [x * 2 for x in data]

Memory Issues:

# ❌ Bad: Load entire file
with open("huge_file.csv") as f:
    data = f.readlines()  # All in memory

# ✅ Good: Generator/streaming
def read_chunks(file_path, chunk_size=1000):
    with open(file_path) as f:
        while True:
            chunk = list(itertools.islice(f, chunk_size))
            if not chunk:
                break
            yield chunk

I/O Bottlenecks:

# ❌ Sequential requests
results = []
for url in urls:
    results.append(requests.get(url))

# ✅ Concurrent requests
import asyncio
import aiohttp

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        return await asyncio.gather(*tasks)

LLM-Specific Optimizations

Quantization:

# Load in 4-bit for faster inference
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=bnb_config
)

Batching:

# ❌ Process one at a time
for prompt in prompts:
    response = llm.generate(prompt)

# ✅ Batch process
responses = llm.generate(prompts, batch_size=16)

Response Caching:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=10000)
def cached_llm_call(prompt_hash):
    return llm.generate(unhash(prompt_hash))

def call_with_cache(prompt):
    h = hashlib.sha256(prompt.encode()).hexdigest()
    return cached_llm_call(h)

Streaming:

# Stream for perceived speed
for chunk in llm.generate(prompt, stream=True):
    print(chunk, end="", flush=True)

GPU Optimization

Maximize Utilization:

# Check current utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv

# Increase batch size until GPU is ~80-90% utilized
# Too low utilization = wasted GPU capacity

# Use mixed precision
with torch.autocast("cuda"):
    output = model(input)

Memory Management:

# Clear cache when needed
torch.cuda.empty_cache()

# Delete unused tensors
del large_tensor

# Use gradient checkpointing
model.gradient_checkpointing_enable()

Data Loading:

# Use multiple workers for data loading
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,  # Parallel loading
    pin_memory=True,  # Faster GPU transfer
    prefetch_factor=2
)

Optimization Checklist

□ Profile before optimizing
□ Identify actual bottleneck (CPU, GPU, I/O, memory)
□ Apply targeted fix
□ Measure improvement
□ Check for regressions
□ Document changes
□ Repeat until goals met

Tools Summary

Purpose         | Tool
----------------|---------------------------
Python profile  | cProfile, line_profiler
Memory profile  | memory_profiler, tracemalloc
GPU profile     | nvidia-smi, nsys, PyTorch profiler
Web/API         | locust, k6
Benchmarking    | pytest-benchmark, timeit

Performance optimization is a systematic discipline, not guesswork — measuring before optimizing ensures effort is focused on actual bottlenecks, leading to real improvements rather than premature optimization that adds complexity without benefit.

performance optimizationprofilingcprofilebottlenecksvectorizationcachinggpu utilization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.