Abstract
My technical project investigates how Reinforcement Learning from Human Feedback (RLHF) and activation steering can be used to fine-tune LLMs. The dominant industry approach of retraining AI from scratch on exponentially larger datasets for improvement is computationally and environmentally costly. Training GPT-4, for instance, required approximately 25,000 GPUs running for weeks, consuming vast amounts of electricity and water for cooling. My work proposes a more efficient alternative by applying post-training alignment methods to existing open-source models rather than retraining from scratch.
My project began using OLMo2-SFT, an open-source safety-fine-tuned model, as the intended base for experimentation. However, practical constraints such as authentication barriers with gated model checkpoints, and GPU memory incompatibilities on UVA's Rivanna computing cluster led me to transition to using Llama-2-7B, which offered more reliable behavior out of the box.
RLHF, as implemented in this project, uses the PKU-SafeRLHF preference dataset to train a reward model that scores responses based on human-labeled pairwise preferences, distinguishing chosen (preferable) responses from rejected ones. It additionally uses a cost model that penalizes safety violations. These are then used in a Proximal Policy Optimization (PPO) training loop to update the model's behavior in a resource-efficient way. To run this pipeline on Rivanna's V100 GPU architecture, several adaptations were necessary, including switching from bf16 to fp16 precision, enabling ZeRO Stage 2 memory optimization, offloading optimizer states to CPU, and reducing sequence lengths and gradient accumulation steps. Direct Preference Optimization (DPO) was also explored as a simpler, reward-free alternative during earlier stages of the project with OLMo2, though results hovered around 50% accuracy, prompting further investigation into evaluation methodology, including refusal rate tracking and log-probability margin analysis.
Activation steering extends this, as rather than adjusting model weights, steering vectors are extracted, which are directions in the model's hidden state space that correspond to specific behaviors like refusal or censorship. This is done by contrasting the mean activations produced by harmful versus benign prompts at each layer. Harmful and benign prompt categories were derived from the model's own refusal behavior, making the approach self-contained and avoiding the need for additional human intervention. Layerwise analysis of these vectors serves as a tool for auditing how safety alignment is actually encoded inside the model, and how that encoding shifts across checkpoints as tuning progresses.
My STS research paper investigates the seemingly simple question of “why don't more people register as organ donors?” by posing the deeper question of “how do cultural and societal norms and practices impact individual sentiment regarding organ donation?”. Through a qualitative literature review, I found that formal religious doctrine broadly supports organ donation, yet cultural interpretations of those beliefs, mistrust of medical institutions, and widespread misunderstanding of brain death collectively suppress donor registration. Using the Theory of Planned Behavior, I discuss that individual decision-making is driven by one’s general attitude or outlook on a topic, perceived behavioral control, and the subjective norms surrounding the person. I argue that positive individual attitudes toward donation are frequently overridden by subjective norms, particularly family expectations and community pressure fuelled by institutional mistrust, and that organ donation is therefore better understood as a socially negotiated decision than an individual one.
In dissecting the relationship between religious outlooks on organ donation and how those interact with cultural practices, I found an interesting discrepancy. Though most major world religions deem organ donation as an altruistic, life-saving act, I found that most communities do not follow religion exactly according to doctrine and instead opt for community-level interpretations of the religion for daily practice. It is these culturally-adapted practices that often promote the idea of saving the integrity of the body post-death and this dissuade individuals from allowing the post-mortem harvesting of one’s organs.
Both projects challenge the assumption that scaling or broadcasting is the right solution to a resource problem. My technical work argues that retraining ever-larger models is neither the only nor the best path to AI improvement. My STS work argues that broad awareness campaigns are insufficient to move the needle on donor registration without culturally tailored engagement. In each case, there exists an argument for a more sustainable and effective approach rather than general, overarching, resource-intensive solutions. Together, both projects communicate that responsible and effective solutions require literacy in both the technical and the social dimensions of a problem.