ICCV 2015 Tutorial on the Mathematics of Deep Learning
Saturday December 12th, 2015, 14:00-18:00
The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for properties of special classes of deep networks, such as global optimality, invariance, and stability of the learned representations.
Joan Bruna, Assistant Professor of Statistics, UC Berkeley
Guillermo Sapiro, Professor of Electrical Engineering, Duke University
René Vidal, Professor of Biomedical Engineering, Johns Hopkins University
Introduction (Joan Bruna and René Vidal - 30 minutes)
This introductory lecture will briefly review the recent success of deep architectures in computer vision and use recent results to motivate the following theoretical questions:
- How to deal with the challenge that the learning problem is non-convex?
- Do learning methods get trapped in local minima?
- Why many local solutions seem to give about equally good results?
- Why using rectified linear rectified units instead of other nonlinearities?
- What is the importance of "deep" and "convolutional" in CNN architectures?
- What statistical properties of images are being captured/exploited by deep networks?
- Can we view deep learning as a metric learning problem?
- How can we add robustness to the learning of the network?
- Is there a smart way to select the training data?
Global Optimality in Deep Learning (René Vidal - 45 minutes)
One of the challenges in training deep networks is that the associated optimization problem is non-convex and hence finding a good initialization would appear to be essential. Researchers have tackled this issue by using different ad-hoc or brute force initialization strategies, which often lead to very different local solutions for the network weights. Nonetheless, it would appear that these local solutions give roughly the same (outstanding) results. This lecture will present a mathematical analysis that establishes conditions under which local solutions are globally optimal. In particular, we will show that for a very general class of learning problems for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of rectified linear units (RELU), which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization.
Signal Recovery from Scattering Convolutional Networks (Joan Bruna - 45 minutes)
One of the arguments given for the success of deep networks is that deeper architectures are able to better capture invariant properties of objects and scenes in images. While a mathematical analysis of why this is the case remains elusive, recent progress has started to shed some light on this issue for certain sub-classes of deep networks. In particular, scattering networks are a class of Convolutional Networks whose convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this lecture, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.
On the Stability of Deep Networks and its Relationship with Compressed Sensing and Metric Learning (Guillermo Sapiro and Raja Giryes- 45 minutes)
This lecture will address two fundamental questions: What are deep neural networks doing to metrics in the data and how can we add metric constraints to make the network more robust. Regarding the first question, we know that two important properties of a classification machinery are: (i) the system preserves the important information of the input data; (ii) the training examples convey information for unseen data; and (iii) the system is able to treat differently points from different classes. We show that these fundamental properties are inherited by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have the same The theoretical analysis of deep networks presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure; and provide bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. With respect to the second problem, which is critical since DNNs are known to be very non-robust, we describe a novel deep learning objective formulation that unifies both the classification and metric learning criteria. We then introduce a geometry-aware deep transform to enable a non-linear discriminative and robust feature transform, which shows competitive performance on small training sets for both synthetic and real-world data. We further support the proposed framework with a formal (K; epsilon) -robustness analysis. The works described are the result of intensive theoretical and computational analysis by Raja Giryes, Qiang Qiu, Jiaji Huang, Alex Bronstein, Robert Calderbank, and Guillermo Sapiro; and extend important fundamental contributions by others that will be described during the tutorial, including results on random sensing and robustness.
- Questions and Discussion (15 minutes)