Yuanzhi Li will present his FPO "On the ability of gradient descent to learn neural networks" on Friday, 8/3/2018 at 11am in CS 301

30 Jul 2018

      Yuanzhi Li will present his FPO "On the ability of gradient descent to learn neural networks" on Friday, 8/3/2018 at 11am in CS 301. 

The members of his committee are as follows:Sanjeev Arora (Adviser); Readers: Elad Hazan and Yingyu Liang (University of Wisconsin- Madison); Examiners: Yoram Singer, Matthew Weinberg, Sanjeev Arora 

All are welcome to attend. A copy of his thesis will be available in CS 310. 

Abstract is below. 

Neural networks stand in the center of the recent breakthrough in Machine Leaming. In 
the past few years, neural networks have found enormous of applications in various areas. 
However, due to the intrinsic non-convexity in the networks, theoretical understandings of 
them remain rather limited. The following major questions are largely unsolved: 
1. Why can we train a neural network by minimizing a non-convex objective using local 
search algorithms? 
2. Why does the training result generalizes instead of memorizing? 
On the theory side, the answer to these questions were rather pessimistic: It is known 
that even training a 2 layer neural network is NP-hard, and modem neural networks have so 
large capacity that allow them to fit arbitrary labels, so overfitting seems to be unavoidable. 
However, on the practice side, the answers to these questions are completely different: 
In practice, gradient descent (and its variants SGD, Momentum, Adagrad, Adam, RMSProp 
and AdamDelta) are quite effective in minimizing the training loss, moreover, the solutions 
found by these algorithms also often generalize well, even on neural networks with trillions 
of parameters. 
This thesis aims to build new theorems to address these questions in a way that matches 
the practical observations. 
We first consider neural networks with "identity mapping' (Resnet). We show that 
by adding this mapping to the network, the loss function becomes "locally convex" in a 
neighborhood of the optimal. 
We then consider neural networks with "quadratic activations". We show that even if we 
over-parametrize the network arbitrarily, the set of solutions given by gradient descent will 
remain on a manifold with small intrinsic dimensions, thus they will generalize. 

Next we consider neural networks with multiple outputs. We show that as long as the inputs 
and the labels are actually "structured" and the network is trained using gradient descent, the result 
will still generalize. 

In the end we extend our result to non-negative matrix factorization. We show that 
if we initialize the weights using "pure samples", and train it using an analog of gradient 
descent, then we are able to recover the hidden structures in the data almost optimally.

Nicki Gotsis

tags

participants (1)