Zhiyuan Li will present his General Exam "A trajectory analysis for gradient descent on over-paramterized neural nets" on May 16, 2019 at 2pm in CS 301.
The members of his committee are as follows: Sanjeev Arora (adviser), Elad Hazan, and Ryan Adams
Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below.
Abstract: Rademacher complexity measures the ability of a funciton class to fit random labels, and thus upper bounds the gap between training loss and testing loss. Traditional wisdom thinks that optimization over a function class with high rademacher complexity, such as a over-parametrized neurel net, will induce overfitting, and thus harm the generalization. However, recent works showed empirically that over-parametrization will not degenerate the test accuracy, and that the trajectory of the parameters by (stochastic) gradient descent seem to have much better properties than the whole parameter space, which motivates people to study the trajectory determined by the algorithm.
In this talk, we will use trajectory analysis to show gradient descent over an over-parametrized two-layer net with relu activation has some kernel-like limiting behaviour. Informally, when the net is sufficiently wide, gd always converge to 0 training loss while the l2 distance moved by parameters from initialization remains bounded by some RKHS norm of the training labels, indepedent of width, and the trained network make the almost same predictions on unseen data to the kernel regression solution with the so-called Neural Tangent Kernel(NTK). We also prove a generalization bound for this over-parametrized 2 layer neural nets with relu activation which only relies on the l2 distance moved by the parameter. Thus we conclude that the generalization gap is bounded by the data-dependent RKHS norm, and this explains why adding noise to labels will suffer larger generalization gap while using the same network architecture. Our recent work also shows that this kernel-type limiting behavior holds for multilayer overparametrized neural net.
This talk is based on a joint work with Sanjeev Arora(Princeton University & IAS), Simon S. Du(CMU), Wei Hu(Princeton University) and Ruosong Wang(CMU), “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks”, and another joint work with Sanjeev Arora(Princeton University & IAS), Simon S. Du(CMU), Wei Hu(Princeton University), Ruslan Salakhutdinov(CMU) and Ruosong Wang(CMU), “On Exact Computation with an Infinitely Wide Neural Net”.
References:
1. Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
2. Jacot, Arthur, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems. 2018.
3. Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
4. Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
5. Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pages 295–301, 1997.
6. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), 2017, 2017.
7. Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, Nov 2002.
8. Neyshabur, Behnam, Srinadh Bhojanapalli, and Nathan Srebro. "A pac-bayesian approach to spectrally-normalized margin bounds for neural networks." arXiv preprint arXiv:1707.09564 (2017).
9. Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv Prepr. arXiv1409.1556, Sep 2014.
10. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. arXiv preprint arXiv:1811.03962, 2018b.
11. Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.