CVPR 2017 Tutorial on the Mathematics of Deep Learning
Friday July 21st, 2017, 9:00 AM - 12:00 PM
Honolulu, HI, USA
The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for properties of special classes of deep networks, such as global optimality, invariance, and stability of the learned representations.
Organizers and SPEAKERS
, Assistant Professor, Department of Computer Science, NYU
, Assistant Professor, Department of Electrical Engineering, Tel Aviv University
, Professor, Department of Biomedical Engineering, Johns Hopkins University
09:00-09:15: Introduction (René Vidal)
09:15-10:00: Global Optimality in Deep Learning (René Vidal)
10:00-10:30: Coffee Break
10:30-11:15: Data Structure Based Theory for Deep Learning (Raja Giryes)
11:15-12:00: Signal Recovery from Scattering Convolutional Networks (Joan Bruna)
Introduction (Joan Bruna - 15 minutes)
This introductory lecture will briefly review the recent success of deep architectures in computer vision and use recent results to motivate the following theoretical questions:
- How to deal with the challenge that the learning problem is non-convex?
- Do learning methods get trapped in local minima?
- Why many local solutions seem to give about equally good results?
- Why using rectified linear rectified units instead of other nonlinearities?
- What is the importance of "deep" and "convolutional" in CNN architectures?
- What statistical properties of images are being captured/exploited by deep networks?
- Can we view deep learning as a metric learning problem?
- How can we add robustness to the learning of the network?
- Is there a smart way to select the training data?
Global Optimality in Deep Learning (René Vidal - 45 minutes)
One of the challenges in training deep networks is that the associated optimization problem is non-convex and hence finding a good initialization would appear to be essential. Researchers have tackled this issue by using different ad-hoc or brute force initialization strategies, which often lead to very different local solutions for the network weights. Nonetheless, it would appear that these local solutions give roughly the same (outstanding) results. This lecture will present a mathematical analysis that establishes conditions under which local solutions are globally optimal. In particular, we will show that for a very general class of learning problems for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of rectified linear units (RELU), which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization.
Structure Based Theory for Deep Learning (Raja Giryes - 45 minutes)
In this talk, we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory for deep learning and discuss two recent developments.
We first study the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix.
Then we focus on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations.
Signal Recovery from Scattering Convolutional Networks (Joan Bruna - 45 minutes)
One of the arguments given for the success of deep networks is that deeper architectures are able to better capture invariant properties of objects and scenes in images. While a mathematical analysis of why this is the case remains elusive, recent progress has started to shed some light on this issue for certain sub-classes of deep networks. In particular, scattering networks are a class of Convolutional Networks whose convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this lecture, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.
- Questions and Discussion (15 minutes)