Yuanzhi Li will present his FPO "On the ability of gradient descent to learn neural networks" on Friday, 8/3/2018 at 11am in CS 301.
The members of his committee are as follows:Sanjeev Arora (Adviser); Readers: Elad Hazan and Yingyu Liang (University of Wisconsin- Madison); Examiners: Yoram Singer, Matthew Weinberg, Sanjeev Arora
All are welcome to attend. A copy of his thesis will be available in CS 310.
Abstract is below.
Neural networks stand in the center of the recent breakthrough in Machine Leaming. In
the past few years, neural networks have found enormous of applications in various areas.
However, due to the intrinsic non-convexity in the networks, theoretical understandings of
them remain rather limited. The following major questions are largely unsolved:
1. Why can we train a neural network by minimizing a non-convex objective using local
search algorithms?
2. Why does the training result generalizes instead of memorizing?
On the theory side, the answer to these questions were rather pessimistic: It is known
that even training a 2 layer neural network is NP-hard, and modem neural networks have so
large capacity that allow them to fit arbitrary labels, so overfitting seems to be unavoidable.
However, on the practice side, the answers to these questions are completely different:
In practice, gradient descent (and its variants SGD, Momentum, Adagrad, Adam, RMSProp
and AdamDelta) are quite effective in minimizing the training loss, moreover, the solutions
found by these algorithms also often generalize well, even on neural networks with trillions
of parameters.
This thesis aims to build new theorems to address these questions in a way that matches
the practical observations.
We first consider neural networks with "identity mapping' (Resnet). We show that
by adding this mapping to the network, the loss function becomes "locally convex" in a
neighborhood of the optimal.
We then consider neural networks with "quadratic activations". We show that even if we
over-parametrize the network arbitrarily, the set of solutions given by gradient descent will
remain on a manifold with small intrinsic dimensions, thus they will generalize.
Next we consider neural networks with multiple outputs. We show that as long as the inputs
and the labels are actually "structured" and the network is trained using gradient descent, the result
will still generalize.
In the end we extend our result to non-negative matrix factorization. We show that
if we initialize the weights using "pure samples", and train it using an analog of gradient
descent, then we are able to recover the hidden structures in the data almost optimally.