Abstract
Reinforcement learning (RL) provides a general framework for sequential decision-making, in which an agent learns a policy to maximize long-term reward through interaction with an unknown environment. RL has achieved striking empirical success in domains such as game playing, robotics, and recommendation, and a rich theory now explains when efficient algorithms can attain near-optimal regret or sample complexity in idealized settings. However,
practical deployments must also contend with additional requirements: agents may be required to respect safety or performance constraints during learning, to exploit large offline datasets
collected under prior behavior policies, and to optimize with respect to multiple, potentially conflicting metrics rather than a single reward. For these constrained, hybrid, and multiobjective settings, the theoretical foundations for designing algorithms that are both safe (or otherwise well-behaved) and provably efficient are still incomplete.
This dissertation advances a unified agenda of provably safe and efficient reinforcement learning and its applications along three complementary directions. First, it develops conservative exploration algorithms for tabular episodic Markov decision processes that start from a given safe baseline policy and adaptively mix between the baseline and optimistic policies, guaranteeing episode-wise performance constraints with high probability while retaining
regret rates comparable to constraint-free RL, and further extends these guarantees to the setting where the baseline itself is learned from an offline dataset under suitable coverage conditions. Second, it proposes a hybrid RL framework that augments confidence-based online algorithms with offline data through concentrability coefficients that quantify the mismatch between the behavior and target policies, yielding tight upper and lower bounds for both
sub-optimality gap and regret and revealing how offline samples can act as effective additional episodes when their coverage is appropriate. Third, it brings these ideas to a concrete application by formulating multi-objective prompt selection for large language models as a pure-exploration bandit problem, adapting and simplifying multi-objective bandit algorithms to recover Pareto-optimal prompt sets and to identify best feasible prompts under constraints,
and providing finite-sample guarantees in structured (linear) settings alongside empirical improvements over heuristic baselines.
Together, these contributions provide theory-driven yet practically motivated tools for deploying RL and bandit methods in settings with safety constraints, offline data, and multi-objective performance criteria. The results show that stringent episode-wise safety can coexist with near-optimal regret, that offline logs can be systematically converted into sample-efficiency gains when their coverage is properly quantified, and that even relatively simple bandit models can support principled, multi-metric prompt optimization for modern language models.