Abstract
The continued slowing of Moore’s Law and the end of Dennard scaling have significantly limited traditional approaches to improving processor performance through device-level scaling and higher clock frequencies. As a result, performance gains have increasingly shifted toward architectural innovation, most notably through the integration of domain-specific accelerators that deliver substantial improvements in performance and energy efficiency over general-purpose processors. However, as emerging applications continue to exhibit rapidly growing computational and memory demands, the capabilities of individual accelerators are becoming insufficient to sustain performance scaling. To address this challenge, modern systems have embraced scalability as a primary design paradigm, aggregating compute and memory resources by interconnecting large numbers of accelerators through high-bandwidth networks. This trend has led to the deployment of systems comprising tens to thousands of accelerators organized across diverse and often hierarchical network topologies. Although such scaling enables significant gains in aggregate computing and memory capacity, it introduces fundamental challenges. In particular, the disparity between fast local accesses and comparatively slower remote accesses across accelerators creates non-uniform bandwidth and latency characteristics. As systems scale, this non-uniformity increasingly shifts the performance bottleneck from computation to communication, making interconnect efficiency a dominant factor in overall system performance.
This dissertation addresses this challenge by identifying, characterizing, and mitigating communication overheads in large-scale multi-accelerator systems, with a focus on multi-GPU and multi-PIM architectures. First, it presents a detailed characterization of multi-PIM systems, demonstrating that page table walks across PIM stacks introduce significant communication overheads. To address this, it proposes vPIM, a distributed, network-aware page table design based on contention-aware hashing, along with a pre-translation mechanism that localizes remote page table accesses. These techniques significantly reduce inter-stack communication and improve overall system efficiency. Next, this dissertation studies single-node multi-GPU systems and shows that network bandwidth across GPU groups is a critical determinant of performance. This analysis reveals that network flits are often underutilized, with a significant fraction carrying redundant data, while some are highly latency-sensitive. Based on these insights, this work proposes NetCrafter, a set of three complementary techniques—stitching, trimming, and sequencing that improve effective bandwidth utilization by restructuring flit contents. NetCrafter significantly reduces network traffic and improves performance across a wide range of workloads. Finally, this dissertation extends the analysis to large-scale, multi-node multi-GPU clusters (up to 64 GPUs) utilizing emerging high-bandwidth interconnects like UALink. The results demonstrate that address translation overheads at the network interface, particularly cold misses in translation structures can significantly impact performance, especially for latency-sensitive workloads. For small-scale workloads such as inference, translation latency at the destination node emerges as a dominant component of end-to-end communication time. This chapter identifies this as a key scalability bottleneck and proposes mitigation strategies, including fused pre-translation kernels and software-directed prefetching.
Altogether, this dissertation provides a comprehensive framework for understanding and eliminating communication bottlenecks in scalable architectures. By optimizing interconnect utilization and address translation, this work provides the architectural foundations necessary for efficient performance scaling in the next generation of high-performance multi-accelerator systems.