Non-Convex Methods for Discovering High-Dimensional Structures in Big and Corrupted Data
Project Summary
Discovering structure in high-dimensional data, such as images, videos and 3D point clouds, has become an essential part of scientific discovery in many disciplines, including machine learning, computer vision, pattern recognition, and signal processing. This has motivated extraordinary advances in the past decade, including various sparse and low-rank modeling methods based on convex optimization with provable theoretical guarantees of correct recovery. However, existing theory and algorithms rely on the assumption that high-dimensional data can be well approximated by low-dimensional structures. While this assumption is adequate for some datasets, e.g., images of faces under varying illumination, it may be violated in many emerging datasets, e.g., 3D point clouds. The goal of this project is to develop a mathematical modeling framework and associated non-convex optimization tools for discovering high-dimensional structures in big and corrupted data. This project will develop provably correct and scalable optimization algorithms for learning a union of high-dimensional subspaces from big and corrupted data. The proposed algorithms will be based on a novel framework called Dual Principal Component Pursuit that, instead of learning a basis for each subspace, seeks to learn a basis for their orthogonal complements. In sharp contrast with existing sparse and low-rank methods, which require both the dimensions of the subspaces and the percentage of outliers to be sufficiently small, the proposed framework will lead to results where even subspaces of the highest possible dimension (i.e., hyperplanes) can be correctly recovered from highly corrupted data. This will be achieved by solving a family of non-convex sparse representation problems whose analysis will require the development of novel theoretical results to guarantee the correct recovery of the subspaces from corrupted data. The project will also develop scalable algorithms for solving these non-convex optimization problems and study conditions for their convergence to the global optimum. These algorithms will be evaluated in two major applications in computer vision: segmentation of point clouds and clustering of image categorization datasets.
Detecting Outliers in a High-dimensional Subspace
In [1,2], we studied and developed an L1-based non-convex optimization technique, called Dual Principal Component Pursuit (DPCP), which can be applied to single as well as multiple subspace learning. The method is motivated by the fact that state-of-the-art subspace methods for outlier rejection, such as Outlier Pursuit (Xu et al. 2010), Self-Expressiveness approaches (Soltanokotabi et al. 2011), or RANSAC (Fischler and Bolles 1981), as well as state-of-the-art subspace clustering methods, such as Sparse Subspace Clustering (Elhamifar and Vidal, 2009,2013), and Low-Rank Subspace Clustering (Liu et al. 2010), require the dimension of the subspace to be sufficiently small, and thus fail for high-dimensional subspaces, such as hyperplanes (subspaces of codimension 1). DPCP fills that gap in the literature by solving a non-convex L1 problem on the sphere, whose global solution is a vector orthogonal to the subspace (e.g., the normal vector to the subspace, if the subspace is a hyperplane). We proved that the non-convex problem can be solved through a sequence of linear programs, which minimize the same L1 objective locally on a sequence of tangent spaces to the sphere. We showed that the solutions of these linear programs converge to the global minimum of the non-convex problem in a finite number of steps. By repeating this process as many times as the codimension of the subspace, a basis for the orthogonal complement of the subspace is found. As experiments with real and synthetic data demonstrated, DPCP was able to advance the state-of-the-art in the context of outlier rejection in principal component analysis.
Clustering in a Union of Hyperplanes
In [3,4] we extended the theoretical analysis and algorithms of DPCP to the case where the data are drawn from of a union of hyperplanes. We proposed a sequential hyperplane learning algorithm which first learns the dominant hyperplane H1 by applying DPCP to the data, then computes the normval vector to the second dominant hyperplane H2 and so on. The extensive theoretical analysis in [3,4] reveals that as long as there are enough points in the dataset from the dominant hyperplane H1 (as compared to points from other hyperplanes), then the normal vector to H1 is the unique up to sign solution to the L1 non-convex problem. Moreover, this solution can be provably computed by the efficient algorithms developed for DPCP. To compute the normal vector to H2, instead of removing points that are close to H1, a procedure that is known to be sensitive to the choice of the thresholding parameter, we weight all points according to their distance to the estimated normal to H1, and then solve the non-convex problem on the weighted data. Experiments on synthetic data show that our algorithm dramatically improves the performance of similar sequential algorithms based on traditional approaches such as RANSAC or modern methods such as REAPER.
Acknowledgement
Work supported by NSF grant 1704458.
Publications
[1]
M. C. Tsakiris and R. Vidal.
Dual Principal Component Pursuit
ICCV Workshop on Robust Subspace Learning and Computer Vision 2015.
[2]
M. C. Tsakiris and R. Vidal.
Dual Principal Component Pursuit
Journal of Machine Learning Research (JMLR) 2018.
[3]
M. C. Tsakiris and R. Vidal.
Hyperplane Clustering via Dual Principal Component Pursuit
International Conference on Machine Learning (ICML) 2017.
[4]
M. C. Tsakiris and R. Vidal.
Hyperplane Clustering via Dual Principal Component Pursuit
arXiv:1706.01604v2[cs.CV].