dcsimg

Guest Editors

  • Namrata Vaswani, Iowa State University, USA
  • Yuejie Chi, Ohio State University, USA
  • Thierry Bouwmans, Universit de La Rochelle, France

Publication Date: 2018

Rethinking PCA for Modern Datasets: Theory, Algorithms, and Applications

Guest Editors

  • Namrata Vaswani, Iowa State University, USA
  • Yuejie Chi, Ohio State University, USA
  • Thierry Bouwmans, Universit de La Rochelle, France

Publication Date: 2018

In today's big and messy data age, there is a lot of data generated everywhere around us. Examples include texts, tweets, network traffic, changing Facebook connections, or video surveillance feeds coming in from one or multiple cameras. Before processing any big dataset, the first step is to perform dimension reduction and noise/outlier removal. The most popular way to do this is via principal component analysis (PCA). A popular approach to also simultaneously do noise/outlier removal is via solving the robust PCA problem. PCA is often the first step in various types of exploratory data analysis, predictive modeling, classification and clustering problems. It finds applications in biomedical imaging, computer vision, process fault detection, recommendation systems' design and many more domains. Classical PCA, without constraints, and for clean data, is a solved problem (by the Eckart-Young-Mirsky theorem, computing the top k left singular vectors of the data matrix returns the PCA solution). On the other hand, robust PCA, which refers to the problem of PCA in the presence of outliers, is a much harder problem and one for which provably correct solutions have started appearing only recently. The same is true for PCA with missing data (and the related low rank matrix completion problem) as well as for sparse PCA which refers to the PCA problem when the principal components are assumed to be sparse. In fact, even the classical PCA problem with speed or memory constraints is not a solved problem.

These issues have become practically important for modern datasets because (i) the data matrix is often so large that it cannot be directly stored in the computer's memory, (ii) a lot of today's data consists of missing entries and/or outlier corrupted entries, (iii) a lot of data today is streaming or nearly-streaming and decisions are often needed in near real-time, and (iv) many types of data are better represented as a tensor data-set rather than a vector data-set (matrix). While PCA is a problem that has being studied for over a century, these issues have started becoming important only in the last decade or less.