Abstract:
The combination of deep neural networks with the algorithms and
formalisms of reinforcement learning (RL) shows great promise for
solving otherwise intractable learning tasks. However, practical
applications of deep reinforcement learning remain scarce. The
outstanding challenges of deep RL can be broadly grouped into two
categories, broadly described as "What to learn from experiences?"
and "What experiences to learn from?" In this thesis, I will
describe the work I have done to address the second category,
which is historically the less-studied domain. Specifically, I
address problems of sampling- whether actions, states, or
trajectories- which contain information sufficient for learning a
task. I examine this challenge at three levels of algorithm design
and task complexity, from algorithmic submodules to combinations
of algorithms to hybrid algorithms that violate common RL
conventions.
First, I will present my work on stable and efficient sampling of
actions that optimize a Q-function of continuous-valued actions.
By combining a sample-based optimizer with neural network
approximation, it is possible to obtain both stability in training
as well as computational efficiency and precision in inference.
Second, I will present my work on reward-aware exploration, the
discovery of highly-rewarding states in tasks where commonly-used
sampling methods are insufficient. A teacher "exploration" agent
can discover states and trajectories which maximize the amount a
student "exploitation" agent learns on those experiences, and can
enable the student agent to solve hard tasks which are otherwise
impossible for it.
Third, I will present my work combining reinforcement learning
with heuristic search, for use in task domains where the
transition model is known, but where the combinatorics of the
state space are intractable for traditional search algorithms. I
show that by combining deep Q-learning with a best-first tree
search algorithm, it is possible to find solutions to simple
program synthesis problems with dramatically fewer samples than
common search algorithms require.
Lastly, I will conclude with a summary of the major takeaways of
this work, and discuss potential extensions and future directions
for the efficient sampling of useful experiences in RL.