Yuanzhi Li will present his FPO "On the ability of gradient descent to learn neural networks" on Friday, 8/3/2018 at 11am in CS 301. The members of his committee are as follows:Sanjeev Arora (Adviser); Readers: Elad Hazan and Yingyu Liang (University of Wisconsin- Madison); Examiners: Yoram Singer, Matthew Weinberg, Sanjeev Arora All are welcome to attend. A copy of his thesis will be available in CS 310. Abstract is below. Neural networks stand in the center of the recent breakthrough in Machine Leaming. In the past few years, neural networks have found enormous of applications in various areas. However, due to the intrinsic non-convexity in the networks, theoretical understandings of them remain rather limited. The following major questions are largely unsolved: 1. Why can we train a neural network by minimizing a non-convex objective using local search algorithms? 2. Why does the training result generalizes instead of memorizing? On the theory side, the answer to these questions were rather pessimistic: It is known that even training a 2 layer neural network is NP-hard, and modem neural networks have so large capacity that allow them to fit arbitrary labels, so overfitting seems to be unavoidable. However, on the practice side, the answers to these questions are completely different: In practice, gradient descent (and its variants SGD, Momentum, Adagrad, Adam, RMSProp and AdamDelta) are quite effective in minimizing the training loss, moreover, the solutions found by these algorithms also often generalize well, even on neural networks with trillions of parameters. This thesis aims to build new theorems to address these questions in a way that matches the practical observations. We first consider neural networks with "identity mapping' (Resnet). We show that by adding this mapping to the network, the loss function becomes "locally convex" in a neighborhood of the optimal. We then consider neural networks with "quadratic activations". We show that even if we over-parametrize the network arbitrarily, the set of solutions given by gradient descent will remain on a manifold with small intrinsic dimensions, thus they will generalize. Next we consider neural networks with multiple outputs. We show that as long as the inputs and the labels are actually "structured" and the network is trained using gradient descent, the result will still generalize. In the end we extend our result to non-negative matrix factorization. We show that if we initialize the weights using "pure samples", and train it using an analog of gradient descent, then we are able to recover the hidden structures in the data almost optimally.