Optimizing Large Language Models for Safety Using RLHF and Activation Steering; Exploring Religious, Cultural, and Societal Influences on Organ Donor Sentiment

Bose, Shinjini

Optimizing Large Language Models for Safety Using RLHF and Activation Steering; Exploring Religious, Cultural, and Societal Influences on Organ Donor Sentiment 106 views

Author

Bose, Shinjini, School of Engineering and Applied Science, University of Virginia

Advisors

Ripley, Karina , EN-Engineering and Society , University of Virginia
Evans, David , EN-Comp Science Dept , University of Virginia

Abstract

My technical project investigates how Reinforcement Learning from Human Feedback (RLHF) and activation steering can be used to fine-tune LLMs. The dominant industry approach of retraining AI  from scratch on exponentially larger datasets for improvement is computationally and environmentally costly. Training GPT-4, for instance, required approximately 25,000 GPUs running for weeks, consuming vast amounts of electricity and water for cooling. My work proposes a more efficient alternative by applying post-training alignment methods to existing open-source models rather than retraining from scratch.

My project began using OLMo2-SFT, an open-source safety-fine-tuned model, as the intended base for experimentation. However, practical constraints such as authentication barriers with gated model checkpoints, and GPU memory incompatibilities on UVA's Rivanna computing cluster led me to transition to using Llama-2-7B, which offered more reliable behavior out of the box. 

RLHF, as implemented in this project, uses the PKU-SafeRLHF preference dataset to train a reward model that scores responses based on human-labeled pairwise preferences, distinguishing chosen (preferable) responses from rejected ones. It additionally uses a cost model that penalizes safety violations. These are then used in a Proximal Policy Optimization (PPO) training loop to update the model's behavior in a resource-efficient way. To run this pipeline on Rivanna's V100 GPU architecture, several adaptations were necessary, including switching from bf16 to fp16 precision, enabling ZeRO Stage 2 memory optimization, offloading optimizer states to CPU, and reducing sequence lengths and gradient accumulation steps. Direct Preference Optimization (DPO) was also explored as a simpler, reward-free alternative during earlier stages of the project with OLMo2, though results hovered around 50% accuracy, prompting further investigation into evaluation methodology, including refusal rate tracking and log-probability margin analysis.

Activation steering extends this, as rather than adjusting model weights, steering vectors are extracted, which are directions in the model's hidden state space that correspond to specific behaviors like refusal or censorship. This is done by contrasting the mean activations produced by harmful versus benign prompts at each layer. Harmful and benign prompt categories were derived from the model's own refusal behavior, making the approach self-contained and avoiding the need for additional human intervention. Layerwise analysis of these vectors serves as a tool for auditing how safety alignment is actually encoded inside the model, and how that encoding shifts across checkpoints as tuning progresses.

My STS research paper investigates the seemingly simple question of “why don't more people register as organ donors?” by posing the deeper question of “how do cultural and societal norms and practices impact individual sentiment regarding organ donation?”. Through a qualitative literature review, I found that formal religious doctrine broadly supports organ donation, yet cultural interpretations of those beliefs, mistrust of medical institutions, and widespread misunderstanding of brain death collectively suppress donor registration. Using the Theory of Planned Behavior, I discuss that individual decision-making is driven by one’s general attitude or outlook on a topic, perceived behavioral control, and the subjective norms surrounding the person. I argue that positive individual attitudes toward donation are frequently overridden by subjective norms, particularly family expectations and community pressure fuelled by institutional mistrust, and that organ donation is therefore better understood as a socially negotiated decision than an individual one.

In dissecting the relationship between religious outlooks on organ donation and how those interact with cultural practices, I found an interesting discrepancy. Though most major world religions deem organ donation as an altruistic, life-saving act, I found that most communities do not follow religion exactly according to doctrine and instead opt for community-level interpretations of the religion for daily practice. It is these culturally-adapted practices that often promote the idea of saving the integrity of the body post-death and this dissuade individuals from allowing the post-mortem harvesting of one’s organs.

Both projects challenge the assumption that scaling or broadcasting is the right solution to a resource problem. My technical work argues that retraining ever-larger models is neither the only nor the best path to AI improvement. My STS work argues that broad awareness campaigns are insufficient to move the needle on donor registration without culturally tailored engagement. In each case, there exists an argument for a more sustainable and effective approach rather than general, overarching, resource-intensive solutions. Together, both projects communicate that responsible and effective solutions require literacy in both the technical and the social dimensions of a problem.

Degree

BS (Bachelor of Science)

Keywords

organ dononation; organ donor sentiment; activation steering; rlhf ; organ donor attitudes

Notes

School of Engineering and Applied Science

Bachelor of Science in Computer Science

Technical Advisor: David Evans

STS Advisor: Karina Ripley

Rights

Attribution 4.0 International (CC BY)

Issued Date

2026-05-05

Persistent Link

https://doi.org/10.18130/5axf-6933

Suggested Citation

Bose, Shinjini. Optimizing Large Language Models for Safety Using RLHF and Activation Steering; Exploring Religious, Cultural, and Societal Influences on Organ Donor Sentiment. University of Virginia, School of Engineering and Applied Science, BS (Bachelor of Science), 2026-05-05, https://doi.org/10.18130/5axf-6933.

Files

This item is restricted to UVA until 2026-11-05.