CS Talk - Yuanzhi Li, CMU

Event time: 
Friday, November 6, 2020 - 11:00am
Zoom Presentation See map
Event description: 

CS Talk
Yuanzhi Li, CMU

Title: On the power of over-parameterization in neural networks – Going beyond NTKs

Zoom link: https://yale.zoom.us/j/97035602733

Host: Nisheeth Vishnoi


Over-parameterization is one of the most widely used tools in deep learning: It has been observed that larger neural networks (e.g. networks with much more parameters than the total number of training data) are often much easier to train and also generalizes better comparing to smaller ones. Prior works try to explain this phenomenon in the language of neural tangent kernels (NTKs), which shows the power of the over-parameterized neural network to learn functions that are learnable by kernel methods. In this work, we go beyond the theory of NTKs and consider the power of over-parameterization in ReLU neural networks when the target function is (provably) not learnable by kernel methods.

In this work, we look at the most simple example in this regime, where the concept class consists of (proper-parameterized) two-layer ReLU neural networks and the input follows from standard Gaussian distribution. In this case, to learn functions from this concept class, we prove (1). Any kernel methods (including those NTKs given by any neural networks of any structure) can not achieve small generalization error unless super polynomially many training examples are used. (2). (Poly-size) Over-parameterized two-layer ReLU neural networks can learn the target efficiently (using poly-sample and in poly-time), simply using (stochastic) gradient descent starting from random initialization.

Our work relies on a brand new framework for analyzing over-parameterization in deep learning: We first show that in the ideal case where there are infinitely many neurons and infinitely many training examples, the training objective reduces to a mixture of PCA + tensor decomposition objectives. Although being highly non-convex, we show that starting from random initialization, gradient descent can still optimize this objective efficiently. We then introduce an efficient coupling between the idealized process and the poly-sample, poly-neuron case, showing the convergence of the latter.

Paper: https://arxiv.org/abs/2007.04596


Yuanzhi Li is an assistant professor at CMU, Machine Learning Department (2019-present). He did his Ph.D. at Princeton, under the advice of Sanjeev Arora (2014-2018) followed by a one-year postdoc at Stanford.