CDC 2017 Tutorial on the Mathematics of Deep Learning
Friday December 15th, 2017, 10:00 AM - 12:00 PM
Melbourne, Australia

Description
The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for properties of special classes of deep networks, such as global optimality, invariance, and stability of the learned representations.
Organizers
Joan Bruna, Assistant Professor, Department of Computer Science, NYU
Raja Giryes, Assistant Professor, Department of Electrical Engineering, Tel Aviv University
Stefano Soatto, Professor, Department of Computer Science, UCLA
René Vidal, Professor, Department of Biomedical Engineering, Johns Hopkins University
Schedule
10:00-10:20: Introduction (René Vidal)
10:20-10:40: Global Optimality in Matrix and Tensor Factorization (René Vidal)
10:40-11:00: Global Optimality in Deep Learning (René Vidal)
11:00-11:20: Data Structure Based Theory for Deep Learning (Raja Giryes)
11:20-11:40: Generalization Error and Solving Optimization Problems with Deep Learning (Raja Giryes)
11:40-12:00: A Picture of the Energy Landscape of Deep Neural Networks (Pratik Chaudhari)
Syllabus
  1. Introduction (Rene Vidal and Raja Giryes- 20 minutes) This introductory lecture will briefly review the recent success of deep architectures in computer vision and use recent results to motivate the following theoretical questions:
    1. How to deal with the challenge that the learning problem is non-convex?
    2. Do learning methods get trapped in local minima?
    3. Why many local solutions seem to give about equally good results?
    4. Why using rectified linear rectified units instead of other nonlinearities?
    5. What is the importance of "deep" and "convolutional" in CNN architectures?
    6. What statistical properties of images are being captured/exploited by deep networks?
    7. Can we view deep learning as a metric learning problem?
    8. How can we add robustness to the learning of the network?
    9. Is there a smart way to select the training data?

  2. Global Optimality in Deep Learning (René Vidal - 20 minutes) One of the challenges in training deep networks is that the associated optimization problem is non-convex and hence finding a good initialization would appear to be essential. Researchers have tackled this issue by using different ad-hoc or brute force initialization strategies, which often lead to very different local solutions for the network weights. Nonetheless, it would appear that these local solutions give roughly the same (outstanding) results. This lecture will present a mathematical analysis that establishes conditions under which local solutions are globally optimal. In particular, we will show that for a very general class of learning problems for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of rectified linear units (RELU), which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization.

  3. Data Structure Based Theory for Deep Learning (Raja Giryes - 20 minutes) This lecture will address two fundamental questions: What are deep neural networks doing to metrics in the data and how can we add metric constraints to make the network more robust. Regarding the first question, we know that three important properties of a classification machinery are: (i) the system preserves the important information of the input data; (ii) the training examples convey information for unseen data; and (iii) the system is able to treat differently points from different classes. We show that these fundamental properties are inherited by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have the same The theoretical analysis of deep networks presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure; and provide bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. With respect to the second problem, which is critical since DNNs are known to be very non-robust, we study the generalization error of deep networks and describe how it is possible to reduce this error via regularization. The works described are the result of intensive theoretical and computational analysis by Raja Giryes, Jure Sokolic, Miguel Rodrigues, Qiang Qiu, Jiaji Huang, Alex Bronstein, Robert Calderbank, and Guillermo Sapiro; and extend important fundamental contributions by others that will be described during the tutorial, including results on random sensing, regularization, generalization error and robustness.

  4. Generalization Error and Solving Optimization Problems with Deep Learning (Raja Giryes - 20 minutes) In this talk, we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory for deep learning and discuss two recent developments. We first study the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix. Then we focus on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations.

  5. A picture of the energy landscape of deep neural networks (Pratik Chaudhari - 20 minutes) I will discuss some peculiarities of the residual surface discovered in the statistical physics literature, and a recently developed algorithm named Entropy-SGD that exploits them. Entropy-SGD can be shown to compute the solution of a viscous Hamilton-Jacobi PDE, which leads to a non-greedy, stochastic optimal control counterpart of SGD that is provably faster. This analysis establishes a previously unknown link between tools of statistical physics, non-convex optimization and the theory of PDEs. In the limit as the viscosity goes to zero, the non-viscous Hamilton-Jacobi PDE leads to the well-known proximal point iteration via the Hopf-Lax formula, thereby providing another link to classical techniques in convex optimization. Moreover, Entropy-SGD includes as special cases popular algorithms in the deep learning literature, e.g., distributed algorithms like Elastic-SGD as well as algorithms in Bayesian machine learning. It enjoys exceptional empirical performance and obtains state-of-the-art generalization errors with optimal convergence rates, all without any extra hyper-parameters to tune.