CVPR 2016 Tutorial on the Mathematics of Deep Learning
Sunday June 26th, 2016, 14:00-18:00
Las Vegas, NV, USA

Description
The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. This tutorial will review recent work that aims to provide a mathematical justification for properties of special classes of deep networks, such as global optimality, invariance, and stability of the learned representations.
Organizers and SPEAKERS
Joan Bruna, Assistant Professor, Department of Statistics, UC Berkeley
Benjamin Haeffele, Research Scientist, Center for Imaging Science, Johns Hopkins University
Raja Giryes, Assistant Professor, Department of Electrical Engineering, Tel Aviv University
Guillermo Sapiro, Professor, Department of Electrical Engineering, Duke University
Amnon Shashua, Professor, Department of Computer Science, Hebrew University of Jerusalem
René Vidal, Professor, Department of Biomedical Engineering, Johns Hopkins University
Schedule and Slides
14:00-14.15: Introduction (Rene Vidal)
14:15-15.00: Inductive Bias and Depth Efficiency of Convolutional Networks (Amnon Shashua)
15:00-15.45: Global Optimality in Deep Learning (Ben Haeffele and René Vidal)
15:45-16.15: Coffee Break
16:15-17:00: Data Structure Based Theory for Deep Learning (Raja Giryes and Guillermo Sapiro)
17:00-17:45: Signal Recovery from Scattering Convolutional Networks (Joan Bruna)
Syllabus
  1. Introduction (Joan Bruna and René Vidal - 30 minutes) This introductory lecture will briefly review the recent success of deep architectures in computer vision and use recent results to motivate the following theoretical questions:
    1. How to deal with the challenge that the learning problem is non-convex?
    2. Do learning methods get trapped in local minima?
    3. Why many local solutions seem to give about equally good results?
    4. Why using rectified linear rectified units instead of other nonlinearities?
    5. What is the importance of "deep" and "convolutional" in CNN architectures?
    6. What statistical properties of images are being captured/exploited by deep networks?
    7. Can we view deep learning as a metric learning problem?
    8. How can we add robustness to the learning of the network?
    9. Is there a smart way to select the training data?

  2. Inductive Bias and Depth Efficiency of Convolutional Networks (Amnon Shashua - 45 minutes) Our formal understanding of the inductive bias that drives the success of deep convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. I will present recent work showing how exponential depth efficiency is achieved in a family of deep networks called Convolutional Arithmetic Circuits, show that CAC is equivalent to SimNets, show that depth efficiency is superior to conventional ConvNets and show how inductive bias is tied to correlations between regions of the input image. In particular, correlations are formalized through the notion of separation rank, which for a given input partition, measures how far a function is from being separable. I will show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. In addition to analyzing deep networks, I will show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth -- they are able to efficiently model strong correlation under favored partitions of the input. This work covers material recently presented in COLT, ICML and CVPR including recent Arxiv submissions. The work was jointly done with doctoral students Nadav Cohen and Or Sharir.

  3. Global Optimality in Deep Learning (Ben Haeffele and René Vidal - 45 minutes) One of the challenges in training deep networks is that the associated optimization problem is non-convex and hence finding a good initialization would appear to be essential. Researchers have tackled this issue by using different ad-hoc or brute force initialization strategies, which often lead to very different local solutions for the network weights. Nonetheless, it would appear that these local solutions give roughly the same (outstanding) results. This lecture will present a mathematical analysis that establishes conditions under which local solutions are globally optimal. In particular, we will show that for a very general class of learning problems for which both the loss function and the regularizer are sums of positively homogeneous functions of the same degree, a local optimum such that many of its entries are zero is also a global optimum. These results will also provide a possible explanation for the success of rectified linear units (RELU), which are positively homogeneous functions. Particular cases of this framework include, in addition to deep learning, matrix factorization and tensor factorization.

  4. Data Structure Based Theory for Deep Learning (Raja Giryes and Guillermo Sapiro- 45 minutes) This lecture will address two fundamental questions: What are deep neural networks doing to metrics in the data and how can we add metric constraints to make the network more robust. Regarding the first question, we know that three important properties of a classification machinery are: (i) the system preserves the important information of the input data; (ii) the training examples convey information for unseen data; and (iii) the system is able to treat differently points from different classes. We show that these fundamental properties are inherited by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have the same The theoretical analysis of deep networks presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure; and provide bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks. With respect to the second problem, which is critical since DNNs are known to be very non-robust, we study the generalization error of deep networks and describe how it is possible to reduce this error via regularization. The works described are the result of intensive theoretical and computational analysis by Raja Giryes, Jure Sokolic, Miguel Rodrigues, Qiang Qiu, Jiaji Huang, Alex Bronstein, Robert Calderbank, and Guillermo Sapiro; and extend important fundamental contributions by others that will be described during the tutorial, including results on random sensing, regularization, generalization error and robustness.

  5. Signal Recovery from Scattering Convolutional Networks (Joan Bruna - 45 minutes) One of the arguments given for the success of deep networks is that deeper architectures are able to better capture invariant properties of objects and scenes in images. While a mathematical analysis of why this is the case remains elusive, recent progress has started to shed some light on this issue for certain sub-classes of deep networks. In particular, scattering networks are a class of Convolutional Networks whose convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this lecture, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.

  6. Questions and Discussion (15 minutes)