Mathematics of Deep Learning

Time: F 1:00-3:00 pm (10-04-19 to 12-06-19)
Place: Shaffer 300
Instructor: René Vidal (OH: F 3:00-4:00 pm, Clark 302B)
TA: Connor Lane (OH: Tu 4:00-5:00 pm, Clark 311A or B)

Course Description

The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. For example, a key issue is that the training problem is nonconvex, hence optimization algorithms are not guaranteed to return a global minima. Another key issue is that while the size of deep networks is very large relative to the number of training examples, deep networks appear to generalize very well to unseen examples and new tasks. This course will overview recent work on the theory of deep learning that aims to understand the interplay between architecture design, regularization, generalization, and optimality properties of deep networks.

Class Schedule

**10/04:**Introduction.- Reading: (Vidal et al., 2017).
- Lecture slides: here.

**10/11:**Optimization landscape for shallow networks.- Reading: (Baldi and Hornik, 1989).
**Homework 1 released, due Thursday 10/24 11:59 PM ET.**

- 10/18: No class (Fall break).
- 10/22: (Optional) MINDS/CIS Seminar: Dr. Jason D Lee (12:00 PM, Hodson 310)
**10/25:**Optimization landscape for deep networks.**11/01:**Analysis of SGD and entropy SGD (Guest lecture by Dr. Chaudhari).- Reading: (Bottou et al., 2018), Ch. 3-4.2, (Chaudhari et al., 2017).
- Lecture notes.

**11/08:**Analysis of inductive bias of dropout.- Reading: (Cavazza et al., 2018), (Mianjy et al., 2018).
**Homework 2 released, due Thursday 11/21 11:59 PM ET.**- For problem 1, let \(\mathcal{W} = \{(\mathbf{U}^*, \mathbf{V}^*) \:|\: f(\mathbf{U}^*, \mathbf{V}^*) = \min f(\mathbf{U}, \mathbf{V})\}\). Then we say \(\mathcal{W}\) is bounded if there exists some \(M > 0\) such that for all \((\mathbf{U}^*, \mathbf{V}^*) \in \mathcal{W}\), \( (\|\mathbf{U}^*\|_F^2 + \|\mathbf{V}^*\|_F^2)^{\frac{1}{2}} \leq M\).
- Homework 2 solutions here.

**11/15:**Generalization theory for deep and shallow networks.- 11/20: (Optional) MINDS Symposium on the Foundations of Data Science (all day in Shriver Hall).
**11/22:**Approximation theory: sparsity (Guest lecture by Dr. Jeremias Sulam).- Reading: (Papyan et al., 2018), (Sulam et al., 2018).

- 11/29: No class (Thanksgiving).
**12/06**Approximation theory for deep and shallow networks.**Homework 3 released, due Tuesday 12/17 11:59 PM ET.**

Reading

- R. Vidal, J. Bruna, R. Giryes, S. Soatto. Mathematics of Deep Learning, arXiv 1712.04741, 2017
- P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks 2 (1), 53-58, 1989.
- B. Haeffele and R. Vidal. Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications, TPAMI 2019
- B. Haeffele and R. Vidal. Global Optimality in Neural Network Training, IEEE Conference on Computer Vision and Pattern Recognition, 2017
- P. Chaudhari et al. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. arXiv 1611.01838, 2016.
- N. Srivastava et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 2014.
- J. Cavazza, B. D. Haeffle, C. Lane, P. Morerio, V. Murino, R. Vidal. Dropout as a Low-Rank Regularizer for Matrix Factorization. International Conference on Artificial Intelligence and Statistics, 2018.
- P. Mianjy, R. Arora and R. Vidal. On the Implicit Bias of Dropout. International Conference on Machine Learning, 2018.
- G. Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2 (4), 303-314, 1989.
- K. Hornik, M. Stinchcombe and H. White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.
- K. Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251-257, 1991.
- G. Montúfar, R. Pascanu, K. Cho, Y. Bengio. On the number of linear regions of deep neural networks. NIPS, 2014
- H. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 2016.
- M. Telgarsky. Benefits of depth in neural networks. COLT 2016.
- H. Bölcskei, P. Grohs, G. Kutyniok, P. Petersen. Memory-optimal neural network approximation. Wavelets and Sparsity, 2017.
- E. Sontag. VC Dimension of Neural Networks. Neural Networks and Machine Learning, 1998.
- P. Bartlett, W. Maass. Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 2003.
- B. Neyshabur, R. Salakhutdinov, N Srebro. Path-SGD: Path-Normalized Optimization in Deep Neural Networks. NIPS, 2015.
- R. Shwartz-Ziv and N. Tishby. Opening the black box of deep neural networks via information. arXiv:1703.00810, 2017.
- A. Achille and S. Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- T. Liang, T. Poggio, A. Rakhlin, J. Stokes. Fisher-Rao Metric, Geometry and Complexity of Neural Networks. arXiv:1711.01530, 2017.
- V. Papyan, Y. Romano, J. Sulam, M. Elad. Theoretical Foundations of Deep Learning via Sparse Representations. IEEE Signal Processing Magazine, 2018.
- J. Sulam, A. Aberdam, A. Beck, M. Elad. On multi-layer basis pursuit, efficient algorithms and convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

Honor Policy

The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful. Ethical violations include cheating on exams, plagiarism, reuse of assignments, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition.