ICCV 2017 Tutorial on the Mathematics of Deep Learning
Sunday October 22nd, 2017, 8:30 AM - 12:30 PM
Venice, Italy

Description
The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for properties of special classes of deep networks, such as global optimality, invariance, and stability of the learned representations.
Organizers
Joan Bruna, Assistant Professor, Department of Computer Science, NYU
Raja Giryes, Assistant Professor, Department of Electrical Engineering, Tel Aviv University
René Vidal, Professor, Department of Biomedical Engineering, Johns Hopkins University
Schedule
08:30-08:45: Introduction (René Vidal)
08:45-09:30: Global Optimality in Deep Learning (René Vidal)
09:30-10:15: Structure Based Theory for Deep Learning (Raja Giryes)
10:15-11:00: Coffee Break
11:00-11:45: Signal Recovery from Scattering Convolutional Networks (Joan Bruna)
11:45-12:30: Signal Modeling: From Convolutional Sparse Coding to Convolutional Neural Networks (Vardan Papyan)
12:30-13:00: Questions
Syllabus
  1. Introduction (Rene Vidal - 15 minutes) This introductory lecture will briefly review the recent success of deep architectures in computer vision and use recent results to motivate the following theoretical questions:
    1. How to deal with the challenge that the learning problem is non-convex?
    2. Do learning methods get trapped in local minima?
    3. Why many local solutions seem to give about equally good results?
    4. Why using rectified linear rectified units instead of other nonlinearities?
    5. What is the importance of "deep" and "convolutional" in CNN architectures?
    6. What statistical properties of images are being captured/exploited by deep networks?
    7. Can we view deep learning as a metric learning problem?
    8. How can we add robustness to the learning of the network?
    9. Is there a smart way to select the training data?

  2. Global Optimality in Deep Learning (René Vidal - 45 minutes) One of the challenges in training deep networks is that the associated optimization problem is non-convex and hence finding a good initialization would appear to be essential. Researchers have tackled this issue by using different ad-hoc or brute force initialization strategies, which often lead to very different local solutions for the network weights. Nonetheless, it would appear that these local solutions give roughly the same (outstanding) results. This lecture will present a mathematical analysis that establishes conditions under which local solutions are globally optimal. In particular, we will show that for a very general class of learning problems for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of rectified linear units (RELU), which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization.

  3. Structure Based Theory for Deep Learning (Raja Giryes - 45 minutes) In this talk, we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory for deep learning and discuss two recent developments. We first study the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix. Then we focus on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations.

  4. Signal Recovery from Scattering Convolutional Networks (Joan Bruna - 45 minutes) One of the arguments given for the success of deep networks is that deeper architectures are able to better capture invariant properties of objects and scenes in images. While a mathematical analysis of why this is the case remains elusive, recent progress has started to shed some light on this issue for certain sub-classes of deep networks. In particular, scattering networks are a class of Convolutional Networks whose convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this lecture, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.

  5. Signal Modeling: From Convolutional Sparse Coding to Convolutional Neural Networks (Vardan Papyan and Michael Elad - 45 minutes) Convolutional sparse coding (CSC) has recently gained increasing attention in the signal and image processing communities. Although several works have presented algorithmic solutions to the global pursuit problem under this new model, very few truly effective guarantees are known for the success of such methods. In this talk, we present a thorough analysis of the CSC model and its associated pursuit, showing new and much stronger guarantees when compared to those of the classical sparse theory. Specifically, we prove that uniqueness of representation, stability with respect to noise, and successful greedy and convex recovery are all guaranteed assuming that the underlying representation is locally sparse. We then present a multi-layer extension of the CSC model, named ML-CSC, which is tightly connected to Convolutional Neural Networks (CNN). More concretely, we show that the forward pass of the CNN is identical to a pursuit associated with our proposed model. Leveraging this connection, we bring a fresh view to CNN with a deeper theoretical understanding. In particular, we show that the forward pass is guaranteed to recover an estimate of the underlying representations of an input signal, assuming these are sparse in a local sense. Moreover, we show that a mild corruption in the input signal does not change this property, indicating the stability of the CNN. Lastly, we exploit the answers to the above questions in order to propose an alternative to the forward pass algorithm, which is tightly connected to deconvolutional and recurrent networks.

  6. Questions and Discussion (15 minutes)