Flame Graphs

Flame Graphs are the hierarchical visualization of software profiling data where bar width represents time spent (including children) and bar height represents call stack depth — created by Brendan Gregg at Netflix to make CPU profiling data immediately interpretable, revealing exactly which function calls consume the most time in a program.

What Is a Flame Graph?

- Definition: A visualization technique for stack trace profiling data where each horizontal bar represents a function in the call stack, bar width encodes the proportion of total sampling time spent in that function and all its callees, and vertical stacking shows the call hierarchy (caller below, callee above).
- Created By: Brendan Gregg at Netflix (2011) — now the universal standard for CPU profiling visualization in production systems globally.
- Sampling-Based: Flame graphs are built from statistical sampling — the profiler captures the current call stack thousands of times per second, then aggregates which functions appear most frequently.
- Key Insight: The widest bars at the top of the flame are the actual performance bottlenecks — they are where execution time is being consumed, regardless of how deep in the call chain they sit.

Why Flame Graphs Matter for AI Systems

- Python Overhead Discovery: A flame graph of a training loop often reveals that 40% of CPU time is spent in Python interpreter overhead (object creation, reference counting) rather than actual computation — motivating torch.compile() or moving operations to CUDA.
- Data Pipeline Bottlenecks: Flame graphs of DataLoader workers reveal time spent in image decoding, augmentation transforms, and Python-to-tensor conversion — guiding optimizations like ffcv or NVIDIA DALI.
- Inference Service Profiling: CPU flame graphs of FastAPI/uvicorn inference servers reveal tokenization, request parsing, and JSON serialization overhead — often 20-30% of total latency for short-response models.
- Memory Allocation Hot Paths: Off-CPU flame graphs (time waiting for memory allocation) reveal excessive tensor creation in hot paths — suggesting pre-allocation or buffer reuse.

Reading a Flame Graph

X-Axis (Width): Represents time — specifically the fraction of total profiling samples where that function appeared anywhere in the call stack. A bar spanning 60% of the graph width means 60% of all CPU samples included that function.

Y-Axis (Height): Represents call stack depth — the function at the bottom called the function above it. The "flame" shape arises because deeper functions are called from fewer unique parents.

Color: Generally meaningless in standard flame graphs — colors are randomly assigned to make adjacent bars distinguishable. Some tools use color to encode: language (Python=blue, C=orange), library, or CPU vs off-CPU time.

The "Wide Tower" Pattern: A wide bar that narrows suddenly above it means: "This function consumes significant time in itself (the difference between its own width and the width of its callee bars)." This is the self-time — the actual bottleneck computation.

Flame Graph Types

| Type | What It Shows | Use Case |
|------|--------------|----------|
| CPU Flame Graph | On-CPU execution time | Find compute bottlenecks |
| Off-CPU Flame Graph | Time blocked (I/O, sleep, locks) | Find I/O and concurrency issues |
| Memory Flame Graph | Allocation call stacks | Find memory leak sources |
| CUDA Flame Graph | GPU kernel execution (Nsight) | Find GPU bottlenecks |
| Differential Flame Graph | Red=slower, blue=faster between two profiles | Verify optimization impact |

Generating Flame Graphs

For Python (py-spy):
py-spy record -o profile.svg --pid $(pgrep -f training_script.py)
Generates SVG flame graph of running Python process — zero code instrumentation required.

For PyTorch (built-in):
Use PyTorch Profiler with Chrome trace export — TensorBoard renders flame graph view automatically.

For Linux (perf):
perf record -F 99 -g -- python train.py
perf script | ./flamegraph.pl > profile.svg

For CUDA (Nsight Systems):
nsys profile --trace=cuda,osrt python inference.py
Opens in Nsight Systems GUI with CUDA kernel timeline (similar to flame graph but timeline-based).

Differential Flame Graphs

The most powerful optimization workflow:
1. Profile baseline → generate flame graph A.
2. Apply optimization.
3. Profile optimized → generate flame graph B.
4. Generate differential: functions that got slower appear red, faster appear blue.
5. Verify the optimization actually improved the intended bottleneck without creating new regressions.

Flame graphs are the universal language of performance profiling — their intuitive visual encoding of time-weighted call stacks makes bottleneck identification accessible to any engineer, transforming raw profiling data from overwhelming number tables into immediately actionable visual insights that guide AI system optimization.

Want to learn more?