Processing-in-Memory (PIM) and Near-Data Processing

Processing-in-Memory (PIM) and Near-Data Processing is the computer architecture paradigm that moves computation to where the data resides rather than moving data to where the processor is — addressing the memory bandwidth wall by embedding compute units directly in or near memory (DRAM, HBM, storage), where data-intensive operations like search, aggregation, and simple arithmetic can execute at internal memory bandwidth (10-100× higher than external bus bandwidth) without the energy cost of data movement, which represents 60-90% of total energy in conventional architectures.

The Data Movement Problem

``Conventional: [CPU/GPU] ←── external bus ──→ [DRAM] 64-128 GB/s ~10 pJ/bit transfer energy

Processing-in-Memory: [DRAM + embedded compute] Internal bandwidth: 1-10 TB/s ~0.1 pJ/bit (no bus transfer)`

- Modern CPUs: 50% of power spent on data movement (not computation). - GPU HBM: 3.35 TB/s bandwidth (H100) → still not enough for many workloads. - PIM: Use the massive internal bandwidth of DRAM banks (each bank: ~10-50 GB/s, 32 banks = 320-1600 GB/s).

PIM Approaches

| Approach | Where Compute Lives | Compute Capability | Example | |----------|-------------------|-------------------|--------| | In-DRAM | Inside DRAM die | Very simple (AND, OR, copy) | Ambit, DRISA | | Near-Bank | Logic die in HBM stack | ALU, simple SIMD | Samsung HBM-PIM | | Near-Memory | Buffer chip or interposer | Full processor core | UPMEM, AIM | | Smart SSD | Inside SSD controller | ARM cores + FPGA | Samsung SmartSSD |

Samsung HBM-PIM

`HBM Stack: ┌─────────────────┐ │ DRAM Die 3 │ │ DRAM Die 2 │ Each die bank has small FP16 ALU │ DRAM Die 1 │ → Process data without sending to GPU │ DRAM Die 0 │ │ Base Logic Die │ ← PIM controller + ALUs └─────────────────┘

Optimized for: Element-wise ops, GEMV, embedding lookups Bandwidth: ~1 TB/s internal (vs. 3.35 TB/s external HBM3)``

UPMEM: Commercial PIM

- DIMM-compatible PIM: Replace standard DDR DIMMs with PIM DIMMs.
- Each DIMM: 2,560 processing elements (DPUs), each with:
- 32-bit RISC core, 24 KB instruction mem, 64 KB working mem.
- Direct access to 64 MB MRAM.
- Applications: Genomics (sequence matching), databases (scan/filter), analytics.

PIM-Suitable Workloads

| Workload | Why PIM Helps | Speedup |
|----------|-------------|--------|
| Database scan/filter | Eliminate 90% of rows before transfer | 5-20× |
| Embedding lookup | Random access + simple reduce | 3-10× |
| Graph traversal | Random access, low arithmetic | 5-15× |
| Genome search | String matching, embarrassingly parallel | 10-50× |
| Recommendation inference | Sparse embedding + simple MLP | 3-8× |

PIM-Unsuitable Workloads

| Workload | Why PIM Doesn't Help |
|----------|---------------------|
| Dense matrix multiply | High arithmetic intensity → GPU wins |
| Complex neural networks | Need large shared caches, tensor cores |
| Workloads needing data reuse | PIM has minimal cache |

Energy Efficiency

| Operation | Conventional | PIM | Energy Saving |
|-----------|-------------|-----|---------------|
| 64-bit DRAM read + add | 20 nJ | 2 nJ | 10× |
| 1 GB data scan | 200 mJ | 20 mJ | 10× |
| Embedding lookup (1M table) | 50 mJ | 8 mJ | 6× |

Processing-in-memory is the architectural response to the data movement crisis that dominates modern computing energy budgets — by embedding computation within the memory hierarchy itself, PIM eliminates the fundamental bottleneck of moving data across bandwidth-limited buses, offering order-of-magnitude improvements in energy efficiency and throughput for data-intensive workloads, and representing a potential paradigm shift as memory bandwidth demands continue to outpace interconnect scaling.

Processing-in-Memory (PIM) and Near-Data Processing

Want to learn more?