Group Fairness in Reinforcement Learning and Large Language Models

Author:
Song, Kefan, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisors:
Zhang, Shangtong, EN-Comp Science Dept, University of Virginia
Wei, Chen-Yu, EN-Comp Science Dept, University of Virginia
Meng, Yu, EN-Comp Science Dept, University of Virginia
Abstract:

This thesis addresses an important societal consideration in the application of Reinforcement Learning (RL): the equitable distribution of its benefits across different demographic groups. Specifically, we investigate how to incorporate group fairness into reinforcement learning algorithms to ensure that their societal impact is just and fair. The thesis is organized around two key contributions to group fairness in RL.

The first contribution focuses on multi-task group fairness in reinforcement learn-
ing. In many practical applications, such as recommender systems or fine-tuning large
language models, a single policy is required to perform multiple tasks in real-world environments. In this thesis, we introduce a multi-task fairness constraint and propose a novel algorithm to solving this problem based on constrained optimization. Through experiments in Mujoco, we demonstrate that our method better ensures group fairness compared to the previous approach that lacks this multi-task fairness constraint.

The second contribution studies group fairness in the context of fine-tuning large language models (LLMs) through Reinforcement Learning with Human Feedback (RLHF).
Current approaches to address bias in LLMs largely concentrate on mitigating harmful
language and often overlook group fairness considerations. In this work, we empha-
sis on demographic parity, a key group fairness definition that aligns with the broader fair machine learning research. In this work, we identify reward models as a potential source of bias in the RLHF process and propose a novel evaluation method based on arXiv meta-data for group fairness in reward models. Our experiment on fine-tuning the Phi-1.5 model further demonstrates that biases in reward models can propagate into the fine-tuned LLMs during RLHF training.

Degree:
MS (Master of Science)
Keywords:
Group Fairness, Reinforcement Learning, Large Language Models, Demographic Parity, RLHF
Language:
English
Issued Date:
2025/03/27