Online Archive of University of Virginia Scholarship
Reinforcement Learning in Combinatorial Action Spaces3 views
Author
Landers, Matthew, Computer Science - School of Engineering and Applied Science, University of Virginia
Advisors
Doryab, Afsaneh, EN-SIE, University of Virginia
Abstract
Reinforcement learning (RL) is a general framework for sequential decision-making in which an agent learns, through trial and error, to select actions that maximize long-term cumulative return. Existing methods have been highly successful in domains with small discrete action sets and in continuous control, but a broad class of real-world problems falls outside both regimes. In these settings, the agent must simultaneously select multiple discrete sub-actions that together form a single joint action, giving rise to action spaces that grow exponentially in the number of sub-action dimensions. Standard approaches either model the full joint action space with flat categorical policies, which become intractable as the space grows, or impose structural priors that restore tractability but limit the agent's capacity to capture sub-action interactions.
This dissertation develops three complementary methods for scalable and expressive control in combinatorial action spaces, each exploiting action structure through a distinct mechanism. Branch Value Estimation (BraVE) is a value-based offline RL method that imposes a tree structure over the joint action space, reducing the number of candidates evaluated at each time step from exponential to linear while retaining a value function over complete joint actions. The Sub-Action Interaction Network using Transformers (SAINT) is a policy architecture that uses state-conditioned self-attention to learn contextualized representations of sub-actions, which are then decoded in parallel to enable linear-time sampling while still capturing dependencies among sub-actions. Structured Policy INitialization (SPIN) decouples representation learning from control by pre-training a Transformer encoder with a masked action modeling objective to capture the manifold of coherent actions from offline data, then freezing this encoder and training lightweight policy heads on the induced action space. Collectively, these methods establish that the tractability--expressivity tradeoff limiting existing RL approaches in combinatorial action spaces can be resolved by learning and exploiting action structure.
Landers, Matthew. Reinforcement Learning in Combinatorial Action Spaces. University of Virginia, Computer Science - School of Engineering and Applied Science, PHD (Doctor of Philosophy), 2026-04-03, https://doi.org/10.18130/8239-rd60.