Sobhan Mir Yoosefi will present his FPO "Provable Reinforcement Learning with Constraints and Function Approximation" on Monday, 5/2/2022 at 3pm via Zoom.

Zoom Link: https://princeton.zoom.us/j/5269962765

Committee Members: 
Examiners: Chi Jin (adviser), Matthew Weinberg, Karthik Narasimhan
Readers: Elad Hazan, Jason Lee  

All are welcome to attend.

Abstract:
Reinforcement learning (RL) has gained a surge of interest over the past years, fueled mainly by practical success and new applications in various domains. However, there is still a gap between our theoretical understanding of these RL techniques and their empirical success. In this thesis, we advance our understanding by studying reinforcement learning from a primarily theoretical point of view and designing provably efficient algorithms for two challenging settings of 1) reinforcement learning with constraints and 2) reinforcement learning with function approximation.

1) In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of the desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. First, we propose an algorithmic scheme that can handle RL tasks with general convex constraints improving upon prior works that are either limited to linear constraints or lack theoretical guarantees. Second, focusing on sample-efficient exploration, we develop the first provably efficient algorithm for tabular episodic constrained RL with the ability to handle convex constraints as well as the knapsack setting. Finally, motivated by recent advances in reward-free RL, we propose a simple meta-algorithm such that given any reward-free RL oracle, the constrained RL problems can be directly solved with negligible overheads in sample complexity.

2) Finding the minimal structural assumptions that empower sample-efficient learning is one of RL's most important research directions. This thesis advances our understanding of this fundamental question by introducing a new complexity measure---Bellman Eluder (BE) dimension. We show that the family of RL problems of low BE dimension is remarkably rich, which subsumes a vast majority of existing tractable RL problems, including but not limited to tabular MDPs, linear MDPs, reactive POMDPs, low Bellman rank problems as well as low Eluder dimension problems. We further design a new optimization-based algorithm—GOLF, and reanalyze a hypothesis elimination-based algorithm—OLIVE (proposed in Jiang et al., 2017) . We prove that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters but independent of the size of state-action space. Our regret and sample complexity results match or improve the best existing results for several well-known subclasses of low BE dimension problems. Furthermore, moving towards a more challenging setting of partially observable RL, we study a new subclass of Partially Observable Markov Decision Processes (POMDPs). More specifically, motivated by the problem structure in several physical applications and a commonly used technique known as "frame stacking'', this work proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length m. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). Our results show that a short-term memory suffices for reinforcement learning in these environments.