An Optimization Framework for Understanding Deep Networks

Project Summary

This project seeks to develop a mathematical framework for the analysis of a broad class of non-convex optimization problems, including matrix factorization, tensor factorization, and deep learning. In particular, this project will study the problem of minimizing the sum of a loss function and a regularization function, both of which can be non-convex. Specific goals include characterizing conditions on the loss function, the product map, the network architecture and the regularization strategy that give guarantees of global optimality, provide theoretical insights for the success of existing optimization algorithms and architectures, and offer guidance in the design of novel network architectures.

Global Optimality in Structured Low-rank Matrix Factorization Problems

This work studies various classes of matrix factorization problems. In particular, we study sufficient conditions under which all local minima are guaranteed to be globally optimal as well as conditions under which one can find a global minimum from any initialization using a local descent strategy. Our conditions require the regularization function to control the size of the factorization via a positively homogeneous functions of degree two. These results were submitted for publication to the IEEE Transactions for Pattern Analysis and Machine Intelligence [1].

Global Optimality in Neural Network Training

This work studies the training problem for certain classes of neural networks. In particular, we study sufficient conditions under which all local minima are guaranteed to be globally optimal as well as conditions under which one can find a global minimum from any initialization using a local descent strategy. Our conditions require both the network output and the regularization to be positively homogeneous functions of the network parameters, with the regularization being designed to control the network size. Our results apply to networks with one hidden layer, where size is measured by the number of neurons in the hidden layer, and multiple deep subnetworks connected in parallel, where size is measured by the number of subnetworks. These results were submitted for publication at CVPR 2017 [2], and our paper was accepted for oral presentation (top 2.65% of submissions).

Global Optimality and Implicit Regularization of Dropout

This work studies the implicit regularization induced by Dropout and variants of Dropout in linear matrix factorization models and as applied to the final layer of deep neural networks. Specifically, in this work during we showed that stochastic Dropout regularization is equivalent to penalizing the sum of the product of the squared Euclidean norms of each unit's incoming and outgoing weights. From this we established a convex lower bound for the original non-convex optimization problem and showed that the lower bound is in fact tight at all global optima for the non-convex objective. As a consequence, we were able to give a complete characterization of the set of global minima for the non-convex Dropout regularized problem, in the form of explicit necessary and sufficient conditions. Furthermore, we showed that for small Dropout rates, all local minima are globally optimal, and all saddle points are non-degenerate. As a result, stochastic gradient descent is guaranteed to converge to a global minimum.

In addition, we significantly generalized these results to more general models. In particular, we developed a general framework which allows us to characterize the regularization that is induced by Dropout-style regularization for more general stochastic perturbation schemes. For example, in classical Dropout the outputs of hidden units are set to 0 stochastically, which can be viewed as multiplying the output of a hidden unit by a Bernoulli random variable. We showed that our analysis can be generalized to a variety of other random multiplicative noise models, such as DropBlock, where groups of neurons are dropped simultaneously and showed that this is equivalent to a k-support norm regularization. This analytical framework was further developed and generalized to show the regularization that is induced by an arbitrary multiplicative noise regularization. We also extended our analysis to non-linear models and showed that our analytical framework can be applied to understand the effects of Dropout-style regularization applied to the final layer of a non-linear network.

In addition, we significantly generalized these results to more general models. In particular, we developed a general framework which allows us to characterize the regularization that is induced by Dropout-style regularization for more general stochastic perturbation schemes. For example, in classical Dropout the outputs of hidden units are set to 0 stochastically, which can be viewed as multiplying the output of a hidden unit by a Bernoulli random variable. We showed that our analysis can be generalized to a variety of other random multiplicative noise models, such as DropBlock, where groups of neurons are dropped simultaneously and showed that this is equivalent to a k-support norm regularization. This analytical framework was further developed and generalized to show the regularization that is induced by an arbitrary multiplicative noise regularization. We also extended our analysis to non-linear models and showed that our analytical framework can be applied to understand the effects of Dropout-style regularization applied to the final layer of a non-linear network.

Non-convex Optimization Algorithms for Subspace Clustering

In this work we developed a regularized matrix factorization approach for the problem of clustering data belonging to a union of subspaces. Our problem formulation extends the classic k-subspaces method in two ways. First, we include low-rank regularization on the subspace factors. Importantly, we split the low-rank regularizer into two terms and allowed the penalty parameter of one term to scale with cluster size. This split low-rank regularization promotes simultaneous adaptation to the number of subspaces and their dimensions. Second, we included additional regularization terms designed to model various data corruptions, for example missing entries and outliers. We compared this model with a wide variety of other proposed approaches for subspace clustering in the missing-entry regime.

Additionally, to solve the optimization problem associated with the above approach, we developed a highly-scalable online algorithm based on stochastic gradient descent. Importantly, the optimization problem is highly non-convex and suffers from many poor local minima. To overcome this, we developed a novel online re-initialization method which simultaneously optimizes an ensemble of identical k-subspaces models (different random instances of the same model). We then leveraged their aggregate knowledge by cooperatively exchanging the best individual subspace models among the replicas throughout optimization. In experiments, we showed how this procedure significantly improves upon classic k-means++ initialization schemes - particularly when the number of subspaces is large.

Additionally, to solve the optimization problem associated with the above approach, we developed a highly-scalable online algorithm based on stochastic gradient descent. Importantly, the optimization problem is highly non-convex and suffers from many poor local minima. To overcome this, we developed a novel online re-initialization method which simultaneously optimizes an ensemble of identical k-subspaces models (different random instances of the same model). We then leveraged their aggregate knowledge by cooperatively exchanging the best individual subspace models among the replicas throughout optimization. In experiments, we showed how this procedure significantly improves upon classic k-means++ initialization schemes - particularly when the number of subspaces is large.

Global Optimality in Separable Dictionary Learning and Tensor Factorization

Sparse dictionary learning is a popular method for representing signals as linear combinations of a few elements from a dictionary that is learned from the data. In the classical setting, signals are represented as vectors and the dictionary learning problem is posed as a matrix factorization problem where the data matrix is approximately factorized into a dictionary matrix and a sparse matrix of coefficients. However, in many applications in computer vision and medical imaging, signals are better represented as matrices or tensors (e.g. images or videos), where it may be beneficial to exploit the multi-dimensional structure of the data to learn a more compact representation. One such approach is separable dictionary learning, where one learns separate dictionaries for different dimensions of the data. However, typical formulations involve solving a non-convex optimization problem; thus guaranteeing global optimality remains a challenge.

In this work, we propose a framework that builds upon our recent developments in matrix factorization to provide theoretical and numerical guarantees of global optimality for separable dictionary learning. Specifically, we propose an algorithm to find such a globally optimal solution, which alternates between following local descent steps and checking a certificate for global optimality. We illustrate our approach on diffusion magnetic resonance imaging (dMRI) data, a medical imaging modality that measures water diffusion along multiple angular directions in every voxel of an MRI volume. State-of-the-art methods in dMRI either learn dictionaries only for the angular domain of the signals or in some cases learn spatial and angular dictionaries independently. In this work, we also apply the proposed separable dictionary learning framework to learn spatial and angular dMRI dictionaries jointly and provide preliminary validation on denoising phantom and real dMRI brain data.

In this work, we propose a framework that builds upon our recent developments in matrix factorization to provide theoretical and numerical guarantees of global optimality for separable dictionary learning. Specifically, we propose an algorithm to find such a globally optimal solution, which alternates between following local descent steps and checking a certificate for global optimality. We illustrate our approach on diffusion magnetic resonance imaging (dMRI) data, a medical imaging modality that measures water diffusion along multiple angular directions in every voxel of an MRI volume. State-of-the-art methods in dMRI either learn dictionaries only for the angular domain of the signals or in some cases learn spatial and angular dictionaries independently. In this work, we also apply the proposed separable dictionary learning framework to learn spatial and angular dMRI dictionaries jointly and provide preliminary validation on denoising phantom and real dMRI brain data.

Convergence of Gradient Flow in Overparametrized Matrix Factorization

As part of this wrok we studied a variety of properties of the specific trajectories that are followed by gradient descent when optimizing linear single hidden layer over-parameterized neural networks. This has important implications for a variety of questions. First, it has recently been appreciated that in over-parameterized models there are a very large number of potential globally optimal solutions (often infinitely many), so the particular solution that is found by a particular optimization algorithm (e.g., gradient descent) can create a significant bias for solutions with particular properties, or induce so-called "implicit regularization". Second, other recent work has suggested that over-parameterizing a model also induces an inherent acceleration to the problem, where the over-parameterized model converges faster than a model which has equivalent expressiveness (i.e., can represent the same class of functions) but has fewer parameters. Our work made several key contributions to the literature attempting to explain these phenomena by providing a novel set of theoretical insights which allowed us to remove several overly restrictive assumptions about the problem that were present in prior work and show that our problem of interest is closely connected to solving a Riccati differential equation.

Acknowledgement

Work supported by NSF grant 1618485 (link).

Publications

[1]

Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2019.

[2]

Global Optimality in Neural Network Training

IEEE Conference on Computer Vision and Pattern Recognition (CVPR - oral, top 2.65% of submissions) 2017.

[3]

Curriculum Dropout

IEEE International Conference on Computer Vision (ICCV) 2017.

[4]

Dropout as a Low-Rank Regularizer for Matrix Factorization

Artificial Intelligence and Statistics Conference (AISTATS) 2018.

[5]

Separable Dictionary Learning with Global Optimality and Applications to Diffusion MRI

SIAM Journal on Imaging Sciences 2019.

[6]

On the Regularization Properties of Structured Dropout

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020.

[7]

On the Implicit Bias of Dropout

International Conference on Machine Learning (ICML) 2018.

[8]

Classifying and Comparing Approaches to Subspace Clustering with Missing Data

IEEE Interational Conference on Computer Vision (ICCV) Workshops 2019.

[9]

Adaptive Online k-Subspaces with Cooperative Re-Initialization

IEEE Interational Conference on Computer Vision (ICCV) Workshops 2019.

[10]

Implicit Acceleration in Two-Layer Linear Models.

(under review)

[11]

Optimization Landscape of Neural Networks

chapter in Mathematics of Deep Learning (in press). Cambridge University Press.

[12]

An Analysis of Dropout for Matrix Factorization

International Conference on Artificial Intelligence and Statistics 2018

[13]

Classifying and comparing approaches to subspace clustering with missing data

IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

[14]

Adaptive online k-subspaces with cooperative re-initialization.

IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)