Processing-in-Memory (PIM) and Near-Data Processing

Keywords: processing in memory,pim,near data processing,in memory computing,compute near memory

Processing-in-Memory (PIM) and Near-Data Processing is the computer architecture paradigm that moves computation to where the data resides rather than moving data to where the processor is — addressing the memory bandwidth wall by embedding compute units directly in or near memory (DRAM, HBM, storage), where data-intensive operations like search, aggregation, and simple arithmetic can execute at internal memory bandwidth (10-100× higher than external bus bandwidth) without the energy cost of data movement, which represents 60-90% of total energy in conventional architectures.

The Data Movement Problem

``
Conventional:
[CPU/GPU] ←── external bus ──→ [DRAM]
64-128 GB/s
~10 pJ/bit transfer energy

Processing-in-Memory:
[DRAM + embedded compute]
Internal bandwidth: 1-10 TB/s
~0.1 pJ/bit (no bus transfer)
`

- Modern CPUs: 50% of power spent on data movement (not computation).
- GPU HBM: 3.35 TB/s bandwidth (H100) → still not enough for many workloads.
- PIM: Use the massive internal bandwidth of DRAM banks (each bank: ~10-50 GB/s, 32 banks = 320-1600 GB/s).

PIM Approaches

| Approach | Where Compute Lives | Compute Capability | Example |
|----------|-------------------|-------------------|--------|
| In-DRAM | Inside DRAM die | Very simple (AND, OR, copy) | Ambit, DRISA |
| Near-Bank | Logic die in HBM stack | ALU, simple SIMD | Samsung HBM-PIM |
| Near-Memory | Buffer chip or interposer | Full processor core | UPMEM, AIM |
| Smart SSD | Inside SSD controller | ARM cores + FPGA | Samsung SmartSSD |

Samsung HBM-PIM

`
HBM Stack:
┌─────────────────┐
│ DRAM Die 3 │
│ DRAM Die 2 │ Each die bank has small FP16 ALU
│ DRAM Die 1 │ → Process data without sending to GPU
│ DRAM Die 0 │
│ Base Logic Die │ ← PIM controller + ALUs
└─────────────────┘

Optimized for: Element-wise ops, GEMV, embedding lookups
Bandwidth: ~1 TB/s internal (vs. 3.35 TB/s external HBM3)
``

UPMEM: Commercial PIM

- DIMM-compatible PIM: Replace standard DDR DIMMs with PIM DIMMs.
- Each DIMM: 2,560 processing elements (DPUs), each with:
- 32-bit RISC core, 24 KB instruction mem, 64 KB working mem.
- Direct access to 64 MB MRAM.
- Applications: Genomics (sequence matching), databases (scan/filter), analytics.

PIM-Suitable Workloads

| Workload | Why PIM Helps | Speedup |
|----------|-------------|--------|
| Database scan/filter | Eliminate 90% of rows before transfer | 5-20× |
| Embedding lookup | Random access + simple reduce | 3-10× |
| Graph traversal | Random access, low arithmetic | 5-15× |
| Genome search | String matching, embarrassingly parallel | 10-50× |
| Recommendation inference | Sparse embedding + simple MLP | 3-8× |

PIM-Unsuitable Workloads

| Workload | Why PIM Doesn't Help |
|----------|---------------------|
| Dense matrix multiply | High arithmetic intensity → GPU wins |
| Complex neural networks | Need large shared caches, tensor cores |
| Workloads needing data reuse | PIM has minimal cache |

Energy Efficiency

| Operation | Conventional | PIM | Energy Saving |
|-----------|-------------|-----|---------------|
| 64-bit DRAM read + add | 20 nJ | 2 nJ | 10× |
| 1 GB data scan | 200 mJ | 20 mJ | 10× |
| Embedding lookup (1M table) | 50 mJ | 8 mJ | 6× |

Processing-in-memory is the architectural response to the data movement crisis that dominates modern computing energy budgets — by embedding computation within the memory hierarchy itself, PIM eliminates the fundamental bottleneck of moving data across bandwidth-limited buses, offering order-of-magnitude improvements in energy efficiency and throughput for data-intensive workloads, and representing a potential paradigm shift as memory bandwidth demands continue to outpace interconnect scaling.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT