Efficient and Robust Policy Evaluation for Reinforcement Learning

Author: ORCID icon orcid.org/0009-0002-6557-979X
Liu, Shuze Daniel, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisor:
Zhang, Shangtong, EN-Comp Science Dept, University of Virginia
Abstract:

Evaluating the quality of a policy (i.e, a decision making rule that an agent adopts to interact with the environment) is central to reinforcement learning (RL). The conventional approach requires repeatedly executing the policy and averaging its outcome. However, due to high evaluation variance, this method demands massive active interactions with the environment to obtain data, which is both expensive and slow. At the same time, the stability of many RL algorithms remains an open question, as existing analyses often rely on assumptions that fail to hold in practice. This thesis addresses both challenges by proposing algorithms that enhance the efficiency and robustness of policy evaluation, and by providing a stability analysis of RL algorithms under realistic conditions.

The first part of the thesis focuses on algorithmic innovations reducing the amount of data needed for accurate evaluation. We begin by introducing an optimal data collecting policy that significantly lowers evaluation variance compared to traditional approaches. In settings where multiple policies must be evaluated at once, we propose a shared data-collecting policy that reduces evaluation variance across all interested policies simultaneously. Taking a step further, we develop a doubly-optimal policy evaluation method that optimizes both data collection and processing stages, achieving state-of-the-art variance reduction. To enhance robustness and safety, we also design methods that account for uncertainty in the environment and enforce safety constraints during data collection. Together, these contributions offer a framework for reliable and sample-efficient policy evaluation.

In the second part of the thesis, we focus on the theoretical analysis of RL algorithms. We study the stability of stochastic approximation methods (i.e., iterative methods that update estimates using noisy observations of a target quantity) under Markovian noise (i.e., when the randomness in updates comes from data generated by a Markov process, so that samples are correlated rather than independent). Our framework provides almost sure convergence guarantees under practical conditions and applies to a broader class of RL algorithms than previous results.

Collectively, this thesis advances both the algorithmic and theoretical foundations of policy evaluation, offering tools that are efficient, reliable, and applicable to real-world RL systems.

Degree:
PHD (Doctor of Philosophy)
Keywords:
reinforcement learning, off-policy evaluation, machine learning
Language:
English
Issued Date:
2025/04/22