GPU FFT and Signal Processing

GPU FFT and Signal Processing is the parallel implementation of Fast Fourier Transform and related signal processing operations on GPUs — where cuFFT library delivers 500-2000 GB/s throughput for 1D/2D/3D transforms achieving 60-90% of theoretical peak bandwidth through optimized radix-2/4/8 algorithms, batched processing that amortizes overhead across multiple transforms (90-95% efficiency), and specialized kernels for power-of-2 sizes, making GPU FFT 10-50× faster than CPU implementations and essential for applications like audio processing, image filtering, scientific computing, and deep learning where FFT operations consume 20-80% of compute time and proper optimization through batch sizing, memory layout (interleaved vs planar), precision selection (FP32 vs FP16), and workspace tuning determines whether applications achieve 200 GB/s or 2000 GB/s throughput.

cuFFT Fundamentals:
- 1D FFT: cufftExecC2C() for complex-to-complex; 500-1500 GB/s; most common; power-of-2 sizes optimal
- 2D FFT: cufftExecC2C() with 2D plan; 800-2000 GB/s; image processing; row-column decomposition
- 3D FFT: cufftExecC2C() with 3D plan; 1000-2500 GB/s; volumetric data; scientific computing
- Real FFT: cufftExecR2C(), cufftExecC2R(); 2× memory savings; exploits Hermitian symmetry; 400-1200 GB/s

FFT Algorithms:
- Cooley-Tukey: radix-2/4/8 algorithms; power-of-2 sizes optimal; log2(N) stages; most common
- Bluestein: arbitrary sizes; slower than Cooley-Tukey; 50-70% performance; use for non-power-of-2
- Mixed Radix: combines radix-2/3/5/7; good for composite sizes; 70-90% of radix-2 performance
- Stockham: auto-sort algorithm; no bit-reversal; slightly slower but simpler; 80-95% of Cooley-Tukey

Batched FFT:
- Concept: process multiple independent FFTs; amortizes overhead; 90-95% efficiency vs single FFT
- API: cufftPlanMany() specifies batch count; cufftExecC2C() processes all; single kernel launch
- Performance: 800-2000 GB/s for large batches (>100); 90-95% efficiency; critical for throughput
- Use Cases: audio processing (multiple channels), image processing (multiple images), deep learning (batch processing)

Memory Layout:
- Interleaved: real and imaginary parts interleaved; [r0, i0, r1, i1, ...]; default; easier to use
- Planar: real and imaginary parts separate; [r0, r1, ...], [i0, i1, ...]; 10-30% faster for some sizes
- In-Place: input and output same buffer; saves memory; slightly slower (5-10%); useful for large transforms
- Out-of-Place: separate input and output; faster; requires 2× memory; preferred for performance

Size Optimization:
- Power-of-2: optimal performance; 500-2000 GB/s; radix-2 algorithm; always use when possible
- Composite: product of small primes (2, 3, 5, 7); 70-90% of power-of-2; mixed radix algorithm
- Prime: worst performance; 30-60% of power-of-2; Bluestein algorithm; pad to composite if possible
- Padding: pad to next power-of-2 or composite; 2-5× speedup; acceptable overhead for small padding

Precision:
- FP32: standard precision; 500-1500 GB/s; sufficient for most applications; default choice
- FP64: double precision; 250-750 GB/s; 2× slower; required for high-accuracy scientific computing
- FP16: half precision; 1000-3000 GB/s; 2× faster; acceptable for some applications; limited accuracy
- Mixed Precision: FP16 compute, FP32 accumulation; 800-2000 GB/s; good balance; emerging approach

Workspace Tuning:
- Auto Allocation: cuFFT allocates workspace automatically; convenient but may not be optimal
- Manual Allocation: cufftSetWorkArea() provides workspace; 10-30% speedup with larger workspace; typical 10-100MB
- Size Query: cufftGetSize() queries required workspace; allocate once, reuse; eliminates allocation overhead
- Trade-off: larger workspace enables faster algorithms; diminishing returns beyond 100MB

2D FFT Optimization:
- Row-Column: decompose into 1D FFTs; process rows then columns; 800-2000 GB/s; standard approach
- Transpose: transpose between row and column FFTs; coalesced access; 10-30% speedup
- Batching: batch row FFTs, batch column FFTs; 90-95% efficiency; critical for performance
- Memory Layout: row-major vs column-major; affects coalescing; 10-30% performance difference

3D FFT Optimization:
- Three-Pass: X-direction, Y-direction, Z-direction; 1000-2500 GB/s; standard approach
- Transpose: transpose between passes; coalesced access; 10-30% speedup
- Batching: batch each direction; 90-95% efficiency; critical for large volumes
- Memory: 3D FFT memory-intensive; 6× data movement; bandwidth-limited; optimize layout

Convolution:
- FFT-Based: FFT(A) * FFT(B), then IFFT; O(N log N) vs O(N²) for direct; 10-100× faster for large N
- Overlap-Add: for long signals; split into blocks; overlap and add; 800-1500 GB/s
- Overlap-Save: alternative to overlap-add; discard invalid samples; 800-1500 GB/s
- Threshold: FFT faster than direct for N > 1000-10000; depends on kernel size; profile to determine

Filtering:
- Frequency Domain: FFT, multiply by filter, IFFT; 500-1500 GB/s; efficient for large filters
- Time Domain: direct convolution; 200-800 GB/s; efficient for small filters (<100 taps)
- Hybrid: time domain for small, frequency domain for large; 500-1500 GB/s; optimal approach
- Real-Time: streaming FFT with overlap-add; 800-1500 GB/s; low latency; audio processing

Spectral Analysis:
- Power Spectrum: |FFT(x)|²; 500-1500 GB/s; frequency content; audio, vibration analysis
- Spectrogram: short-time FFT; 800-2000 GB/s; time-frequency representation; speech, audio
- Cross-Correlation: FFT-based; 500-1500 GB/s; signal alignment; radar, sonar
- Autocorrelation: FFT-based; 500-1500 GB/s; periodicity detection; signal processing

Performance Profiling:
- Nsight Compute: profiles cuFFT kernels; shows memory bandwidth, compute throughput, occupancy
- Metrics: achieved bandwidth / peak bandwidth; target 60-90% for FFT; memory-bound operation
- Bottlenecks: non-power-of-2 sizes, small batches, suboptimal layout; optimize based on profiling
- Tuning: adjust batch size, padding, layout, workspace; profile to find optimal

Multi-GPU FFT:
- Data Parallelism: distribute data across GPUs; each GPU processes subset; 70-85% scaling efficiency
- Transpose: all-to-all communication for transpose; InfiniBand or NVLink; 50-70% efficiency
- cuFFTMp: multi-GPU cuFFT library; automatic distribution; 70-85% scaling efficiency
- Use Cases: very large FFTs (>1GB); scientific computing; limited by communication

Best Practices:
- Power-of-2 Sizes: pad to power-of-2 when possible; 2-5× speedup; acceptable overhead
- Batch Processing: batch multiple FFTs; 90-95% efficiency; amortizes overhead
- Out-of-Place: use out-of-place for performance; in-place for memory; 5-10% speedup
- Workspace: provide workspace buffer; 10-30% speedup; allocate once, reuse
- Profile: measure actual bandwidth; compare with peak; optimize only if bottleneck

Performance Targets:
- 1D FFT: 500-1500 GB/s; 60-90% of peak (1.5-3 TB/s); power-of-2 sizes optimal
- 2D FFT: 800-2000 GB/s; 70-95% of peak; batched processing critical
- 3D FFT: 1000-2500 GB/s; 80-95% of peak; large volumes achieve best efficiency
- Batched: 90-95% efficiency vs single; amortizes overhead; critical for throughput

Real-World Applications:
- Audio Processing: real-time FFT for effects, analysis; 800-1500 GB/s; 10-50× faster than CPU
- Image Processing: 2D FFT for filtering, compression; 1000-2000 GB/s; 20-100× faster than CPU
- Scientific Computing: 3D FFT for simulations; 1500-2500 GB/s; enables large-scale problems
- Deep Learning: FFT-based convolution; 800-1500 GB/s; alternative to direct convolution

GPU FFT and Signal Processing represent the acceleration of frequency domain operations — by leveraging cuFFT library that delivers 500-2000 GB/s throughput (60-90% of peak bandwidth) through optimized radix algorithms, batched processing (90-95% efficiency), and specialized kernels, developers achieve 10-50× speedup over CPU implementations and enable real-time audio processing, large-scale image filtering, and scientific computing where FFT operations consume 20-80% of compute time and proper optimization through batch sizing, memory layout, and workspace tuning determines whether applications achieve 200 GB/s or 2000 GB/s throughput.');

Want to learn more?