Home Knowledge Base Flame Graphs

Flame Graphs are the hierarchical visualization of software profiling data where bar width represents time spent (including children) and bar height represents call stack depth — created by Brendan Gregg at Netflix to make CPU profiling data immediately interpretable, revealing exactly which function calls consume the most time in a program.

What Is a Flame Graph?

Why Flame Graphs Matter for AI Systems

Reading a Flame Graph

X-Axis (Width): Represents time — specifically the fraction of total profiling samples where that function appeared anywhere in the call stack. A bar spanning 60% of the graph width means 60% of all CPU samples included that function.

Y-Axis (Height): Represents call stack depth — the function at the bottom called the function above it. The "flame" shape arises because deeper functions are called from fewer unique parents.

Color: Generally meaningless in standard flame graphs — colors are randomly assigned to make adjacent bars distinguishable. Some tools use color to encode: language (Python=blue, C=orange), library, or CPU vs off-CPU time.

The "Wide Tower" Pattern: A wide bar that narrows suddenly above it means: "This function consumes significant time in itself (the difference between its own width and the width of its callee bars)." This is the self-time — the actual bottleneck computation.

Flame Graph Types

TypeWhat It ShowsUse Case
CPU Flame GraphOn-CPU execution timeFind compute bottlenecks
Off-CPU Flame GraphTime blocked (I/O, sleep, locks)Find I/O and concurrency issues
Memory Flame GraphAllocation call stacksFind memory leak sources
CUDA Flame GraphGPU kernel execution (Nsight)Find GPU bottlenecks
Differential Flame GraphRed=slower, blue=faster between two profilesVerify optimization impact

Generating Flame Graphs

For Python (py-spy): py-spy record -o profile.svg --pid $(pgrep -f training_script.py) Generates SVG flame graph of running Python process — zero code instrumentation required.

For PyTorch (built-in): Use PyTorch Profiler with Chrome trace export — TensorBoard renders flame graph view automatically.

For Linux (perf): perf record -F 99 -g -- python train.py perf script | ./flamegraph.pl > profile.svg

For CUDA (Nsight Systems): nsys profile --trace=cuda,osrt python inference.py Opens in Nsight Systems GUI with CUDA kernel timeline (similar to flame graph but timeline-based).

Differential Flame Graphs

The most powerful optimization workflow: 1. Profile baseline → generate flame graph A. 2. Apply optimization. 3. Profile optimized → generate flame graph B. 4. Generate differential: functions that got slower appear red, faster appear blue. 5. Verify the optimization actually improved the intended bottleneck without creating new regressions.

Flame graphs are the universal language of performance profiling — their intuitive visual encoding of time-weighted call stacks makes bottleneck identification accessible to any engineer, transforming raw profiling data from overwhelming number tables into immediately actionable visual insights that guide AI system optimization.

flame graphcpuvisualization

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.