Theoretical Foundations of Big Data Analysis

About

We live in an era of "big data": science, engineering, and technology are producing increasingly large data streams, with petabyte and exabyte scales becoming increasingly common. In scientific fields, such data arise in part because tests of standard theories increasingly focus on extreme physical conditions (e.g., particle physics) and in part because science has become increasingly exploratory (e.g., astronomy and genomics). In commerce, massive data arise because so much of human activity is now online and because business models aim to provide services that are increasingly personalized.

The big data phenomenon presents opportunities and perils. On the optimistic side of the coin, massive data may amplify the inferential power of algorithms that have been shown to be successful on modest-size data sets. The challenge is to develop the theoretical principles needed to scale inference and learning algorithms to massive, even arbitrary scale. On the pessimistic side of the coin, massive data may amplify the error rates that are part and parcel of any inferential algorithm. The challenge is to control such errors even in the face of the heterogeneity and uncontrolled sampling processes underlying many massive data sets. Another major issue is that big data problems often come with time constraints, where a high-quality answer that is obtained slowly can be less useful than a medium-quality answer that is obtained quickly. Overall, we have a problem in which the classical resources of the theory of computation — e.g., time, space, and energy — trade off in complex ways with the data resource.

Various aspects of this general problem are being faced in the theory of computation, statistics, and related disciplines — where topics such as dimension reduction, distributed optimization, Monte Carlo sampling, compressed sensing, low-rank matrix factorization, streaming, and hardness of approximation are of clear relevance — but the general problem remains untackled. This program brought together experts from these areas with the aim of laying the theoretical foundations of the emerging field of big data.