Rethinking Control, Access, and Communication Mechanisms for Data-Intensive Applications

Lenjani, Marzieh, Computer Science - School of Engineering and Applied Science, University of Virginia
Skadron, Kevin, EN-Comp Science Dept, University of Virginia

In current computing systems, the transfer of data between the off-chip memory and the processor takes two orders of magnitude more time and consumes three orders of magnitude higher energy than the actual computation. This dissertation presents two approaches for alleviating the data movement overhead: (i) enabling flexible-bitwidth value representation in the memory hierarchy of general-purpose processors and (ii) processing data inside the memory.
In the first approach, we propose a narrow-bitwidth and overflow-free value representation technique in the cache and main memory of general-purpose processors, which provides 1.23× speed up compared to the state-of-the-art cache compression techniques.
In the second approach, we explore deviating from conventional Von Neumann architectures to minimize data movement by adding processing units as close as possible to the memory cells, i.e., processing in memory (PIM). We show that realizing the true potential of PIM requires rethinking and simplifying the control, access, communication, and load balancing mechanisms of PIM units.
Accordingly, first, we propose a PIM-based accelerator, Fulcrum, that introduces a new architecture for in-memory processing units. The architecture offers a trade-off between (i) full control and access divergence support in SISD and (ii) no/costly control or divergence support in SIMD/SIMT approaches. Fulcrum outperforms a server-class GPU with three stacks of HBM2 and, on average, provides 70× speedup per memory stack and reduces the energy consumption by 96%.
Next, we observe that rethinking communication and load balancing mechanisms can unlock the benefits of PIM for a wider range of applications. Accordingly, in the second version of Fulcrum, dubbed Gearbox, we add hardware support for (i) offloading accumulations operations from one memory segment to another memory segment, and (ii) balancing the load among processing elements. Gearbox can outperform Gunrock, a GPU-based graph-processing framework, by 15.73×. In the third version of Fulcrum, dubbed Pulley, we add hardware support that enables every group of processing elements cooperatively generate an intermediate array. Pulley, on average, delivers 20× speedup compared to Bonsai, an FPGA-based sorting accelerator.

PHD (Doctor of Philosophy)
Data-Intensive, Memory-Intensive, Data movement, PIM, processing in memory, Quantization
Issued Date: