Understanding and Optimizing Memory Access Behaviors via Hardware/Software Co-Design

Author: ORCID icon orcid.org/0000-0002-4203-5012
Ahmed, Alif, Computer Science - School of Engineering and Applied Science, University of Virginia
Skadron, Kevin, EN-Comp Science Dept, University of Virginia

Due to the ever-increasing gap between the speed of processing elements and the speed at which memory systems can feed them with data, current computing systems are often bottlenecked by limited memory bandwidth. To alleviate the memory bandwidth bottleneck, algorithms and data structures for data-intensive applications must be designed to leverage hardware features in the memory hierarchy, and in some cases, a software-hardware co-design approach is beneficial. To that extent, first, we propose Hopscotch, which is a micro-benchmark suite for memory performance characterization. It aims to fill the gap left by existing memory benchmarks by providing kernels covering a wide range of spatio-temporal locality spectrum. Additionally, the Hopscotch suite contains tools for memory-access pattern visualization and empirically deriving the Roofline model, thereby helping users to isolate performance bottlenecks and reverse-engineer memory characteristics. Next, we optimize two data-intensive applications by leveraging the memory-centric hardware features in traditional general-purpose architecture. The BigMap approach optimizes the memory access behavior of a popular fuzzer, AFL, to enable large bitmaps for accurate coverage tracking. In BigMap, we propose a two-level hashing scheme that consolidates scattered random accesses, vastly improving the spatial locality of the bitmap operations. In our next work, GraphTango, we minimize cache line accesses on streaming graphs by proposing a hybrid storage format and a cache-friendly hashing scheme. GraphTango provides excellent update/analytics throughput regardless of a graph's degree distribution, unlike prior approaches. My last two works focus on hardware-software co-optimization for reducing data movement. In Pulley, we provide an in-memory sorting accelerator that can leverage the subarray-level parallelism for a distributed LSB-first radix sorting. This approach avoids the single-point merging bottleneck of the prior near-data and in-memory sorting accelerators. Finally, we propose a vault-level PIM architecture for accelerating the inference tasks on temporal graph neural networks. We incorporate a feature-based partitioning scheme to minimize inter-vault communication and improve the workload balance, resulting in significant throughput gain and latency reduction over other approaches.

PHD (Doctor of Philosophy)
Graph processing, Temporal graph neural network, Processing in memory
Issued Date: