Sparse and Low Rank Methods for Imbalanced and Heterogeneous Data

Project Summary

In recent years, sparse and low-rank modeling techniques have emerged as powerful tools for efficiently processing visual data in non-traditional ways. However, existing methods generally succeed only in the presence of sufficient amounts of homogeneous and balanced training data that are well matched to the actual test conditions. In practice, when the data are heterogeneous and imbalanced, the performance of existing methods can be much worse than expected. This project will develop a new recognition framework based on novel sparse and low-rank modeling techniques, which will be able to deal with imbalanced, heterogeneous and multi-modal data. Imbalanced data will be handled using convex optimization techniques that automatically divide a dataset into common and rare patterns, and select a small set of representatives for the common patterns that are then combined with the rare patterns to form a balanced dataset. Heterogeneous and multi-modal data will be handled using non-convex optimization techniques that learn a latent representation from multiple domains or modalities. Classification and clustering algorithms can be applied to the latent representation. Applications of these methods include image and video-based object recognition, activity recognition, video summarization, and surveillance.

Clustering Imbalanced Data in a Union of Hyperplanes

In [1,2] we developed a clustering algorithm that is able to provably cluster imbalanced data drawn from a union of hyperplanes. By imbalanced data in a union of hyperplanes we mean that the number of points in hyperplane H1 is much larger than the number of points in hyperplane H2 and so forth. The algorithm is sequential in the sense that it first computes the normal vector to the dominant hyperplane H1, then the normal vector to the second dominant hyperplane H2 and so on. The normal vector to H1 is computed by solving an L1 non-convex minimization problem on the sphere. The extensive theoretical analysis of our ICML 2017 paper reveals that as long as there are enough points in the dataset from the dominant hyperplane H1 (as compared to points from other hyperplanes), then the normal vector to H1 is the unique up to sign solution to the L1 non-convex problem. Moreover, this solution can be provably computed by iterating a finite number of times a recursion of linear programs. Alternatively, an Iteratively-Reweighted-Least-Squares (IRLS) approach can be employed to furnish such a normal vector in an efficient fashion. To compute the normal vector to H2, instead of removing points that are close to H1, a procedure that is known to be sensitive to the choice of the thresholding parameter, we weight all points according to their distance to the estimated normal to H1, and then solve the non-convex problem on the weighted data. Experiments on synthetic data show that our algorithm dramatically improves the performance of similar sequential algorithms based on traditional approaches such as RANSAC or modern methods such as REAPER.

Subset Selection for Data in a Union of Subspaces

In [3] we developed a novel subset selection method that is based on the self-expressiveness property of a dataset drawn from a union of subspaces. According to this property, every data point in the dataset can be written as a sparse linear combination of other points in the dataset. Such a sparse combination can be computed by solving a convex optimization problem such as basis pursuit or Lasso. Our method returns a subset of the dataset of fixed cardinality specified by the user, with the property that the selected dataset is optimal in the sense that its symmetrized convex hull contains a maximal portion of the rays generated by all other points in the dataset. In contrast to existing methods, which break down either computationally or in terms of performance for very large or very unbalanced datasets, our method is linear in the size of the dataset. Moreover, experiments on synthetic and real datasets (Caltech-256, Coil-100 and MNIST), show that our method has better or comparable performance to state-of-the-art alternatives, while its running time is orders of magnitude faster.

Acknowledgement

Work supported by NSF grant 1618637.

Publications

[1]

Hyperplane Clustering via Dual Principal Component Pursuit

arXiv:1706.01604v2[cs.CV].

[2]

Hyperplane Clustering via Dual Principal Component Pursuit

International Conference on Machine Learning (ICML) 2017.

[3]

Greedy Self-Representation Based Subset Selection

submitted to Neural Information Processing and Systems (NIPS) 2017.