Porting CUDA Kernels to ROCm; Reliability and Responsibility in the context of Software Engineering

Kiani, Shahryar

Porting CUDA Kernels to ROCm; Reliability and Responsibility in the context of Software Engineering 64 views

Author

Kiani, Shahryar, School of Engineering and Applied Science, University of Virginia

Advisors

Elliott, Travis , AT-Academic Affairs , University of Virginia
Skadron, Kevin , EN-Comp Science Dept , University of Virginia

Abstract

The STS Research Paper and Capstone Project, although both being software related,
looked at two different areas of Computer Science and Software Engineering. The STS Research
Paper looked at the societal and technological implications of businesses shifting the
responsibility of managing their compute resources to cloud providers like AWS or Microsoft
Azure. The Capstone Project was about extending an existing benchmark suite (developed at
UVA) that is used for evaluating parallel computing accelerators. The work updated the
benchmark suite to add direct support for running on AMD GPUs using APIs developed by
AMD, replacing the previous approach using OpenCL and opening the door to better
performance, as well as giving the opportunity to evaluate the tools developed by AMD for
writing code for their hardware.
The STS Research Paper looks specifically at how the growth of cloud providers has led
to a world where software systems are more brittle, and failures impact many services and
companies rather than being isolated to a single system (which is one of the main advantages of
the distributed nature of the internet). The main technique used for the analysis was
Actor-Network Theory, which allows us to describe the various layers between end-users, the
companies that sell software services, and cloud providers, and how those layers of interaction
result in different expectations and incentives for different actors in the network. The paper looks
at cases such as major cloud provider outages, and software failures caused by manufacturing
defects in hardware, to analyze how responsibility for fixing issues was allocated, as well as how
the public and media reacted to the failures. The paper uses these case studies and
Actor-Network theory to argue that the current system essentially incentivizes software
companies putting all their eggs in a few baskets (the cloud providers).
The Capstone Project focused on the specific task of extending the Rodinia benchmark
suite to better support AMD hardware. Rodinia’s benchmarks were implemented in CUDA,
meant for Nvidia hardware, and OpenCL for other hardware, but because it is designed to be
generic, OpenCL code tends to be slower than specialized code. AMD has developed a set of
tools to assist developers in porting CUDA to their equivalent of ROCm, since they want to
position themselves as an alternative to Nvidia, which has become especially important in recent
times, where Nvidia hardware is the default choice for machine learning workloads, and the
demand for their GPUs outstrips supply. So the main result of the project was to evaluate how
effective the tools AMD built are. During the project, 10 benchmarks were successfully ported to
run on AMD hardware. For the majority of the benchmarks, AMD tooling was effective for
porting, requiring only minimal additional manual work. However, some workloads required a
moderate amount of additional effort to run on AMD hardware, since the CUDA
implementations were more specialized to run on Nvidia GPUs. The performance of the AMD
versions also lagged behind the original Nvidia implementations. So, overall, AMD’s tooling
works for getting code designed to run on Nvidia hardware to work on AMD, but getting the
code to run at full performance on AMD requires additional investment, as well as the
maintenance cost of then maintaining two separate implementations.
Although the STS research paper and capstone topic were quite different, they both look
at issues relating to vendor lock-in versus building systems that are meant to run anywhere. The
research paper focuses on the negative costs of software companies depending on a single cloud
provider for their compute infrastructure. On the other hand, the capstone project provided
nuance to the issues of depending on a single vendor by showing through a practical example that maintaining a system 
that isn’t tied to one vendor adds significant maintenance cost to a software project. 
Overall, working on these two projects gave me insight into building software and the tradeoffs that engineers need to make.

Degree

BS (Bachelor of Science)

Keywords

CUDA; ROCm; Cloud Computing; Nvidia; AMD; GPGPU; Software Reliability

Notes

School of Engineering and Applied Science

Bachelor of Science in Computer Science

Technical Advisor: Kevin Skadron

STS Advisor: Travis Elliott

Rights

Attribution 4.0 International (CC BY)

Issued Date

2026-05-08

Persistent Link

https://doi.org/10.18130/tkty-ee17

Suggested Citation

Kiani, Shahryar. Porting CUDA Kernels to ROCm; Reliability and Responsibility in the context of Software Engineering. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2026-05-08, https://doi.org/10.18130/tkty-ee17.

Files

This item is restricted to UVA until 2027-05-08.