Home Knowledge Base Processing-in-Memory (PIM) and Near-Data Processing

Processing-in-Memory (PIM) and Near-Data Processing is the computer architecture paradigm that moves computation to where the data resides rather than moving data to where the processor is — addressing the memory bandwidth wall by embedding compute units directly in or near memory (DRAM, HBM, storage), where data-intensive operations like search, aggregation, and simple arithmetic can execute at internal memory bandwidth (10-100× higher than external bus bandwidth) without the energy cost of data movement, which represents 60-90% of total energy in conventional architectures.

The Data Movement Problem

 Conventional:
 [CPU/GPU] ←── external bus ──→ [DRAM]
              64-128 GB/s
              ~10 pJ/bit transfer energy

 Processing-in-Memory:
 [DRAM + embedded compute]
              Internal bandwidth: 1-10 TB/s
              ~0.1 pJ/bit (no bus transfer)

PIM Approaches

ApproachWhere Compute LivesCompute CapabilityExample
In-DRAMInside DRAM dieVery simple (AND, OR, copy)Ambit, DRISA
Near-BankLogic die in HBM stackALU, simple SIMDSamsung HBM-PIM
Near-MemoryBuffer chip or interposerFull processor coreUPMEM, AIM
Smart SSDInside SSD controllerARM cores + FPGASamsung SmartSSD

Samsung HBM-PIM

 HBM Stack:
 ┌─────────────────┐
 │   DRAM Die 3    │
 │   DRAM Die 2    │  Each die bank has small FP16 ALU
 │   DRAM Die 1    │  → Process data without sending to GPU
 │   DRAM Die 0    │
 │  Base Logic Die │  ← PIM controller + ALUs
 └─────────────────┘

 Optimized for: Element-wise ops, GEMV, embedding lookups
 Bandwidth: ~1 TB/s internal (vs. 3.35 TB/s external HBM3)

UPMEM: Commercial PIM

PIM-Suitable Workloads

WorkloadWhy PIM HelpsSpeedup
Database scan/filterEliminate 90% of rows before transfer5-20×
Embedding lookupRandom access + simple reduce3-10×
Graph traversalRandom access, low arithmetic5-15×
Genome searchString matching, embarrassingly parallel10-50×
Recommendation inferenceSparse embedding + simple MLP3-8×

PIM-Unsuitable Workloads

WorkloadWhy PIM Doesn't Help
Dense matrix multiplyHigh arithmetic intensity → GPU wins
Complex neural networksNeed large shared caches, tensor cores
Workloads needing data reusePIM has minimal cache

Energy Efficiency

OperationConventionalPIMEnergy Saving
64-bit DRAM read + add20 nJ2 nJ10×
1 GB data scan200 mJ20 mJ10×
Embedding lookup (1M table)50 mJ8 mJ

Processing-in-memory is the architectural response to the data movement crisis that dominates modern computing energy budgets — by embedding computation within the memory hierarchy itself, PIM eliminates the fundamental bottleneck of moving data across bandwidth-limited buses, offering order-of-magnitude improvements in energy efficiency and throughput for data-intensive workloads, and representing a potential paradigm shift as memory bandwidth demands continue to outpace interconnect scaling.

processing in memorypimnear data processingin memory computingcompute near memory

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.