Whose job is it: the hardware’s, software’s, or programmer’s?

Perhaps the biggest question in parallel computing for Big Data is, “Who’s responsible for the logical work to harness parallel architectures: the hardware, the compiler, or the programmer?”

Today I found an interesting lecture on MIT’s Open Courseware titled L3: Introduction to Parallel Architectures given as part of the “Multicore Programming Primer” IAP course by Saman Amarasinghe. The lecture discusses high-level parallel computing architectures from the past 50 years of parallel computing and lends insight into this question.

The question is about the distinction between Implicit and Explicit Parallelism

Continue reading

Budget-Constrained Model Selection: Trading off Statistical and Computational Complexities

Big Data requires special attention to the computational aspects of modeling. With lots of data, the options for a researcher to explore are many; however, naively exploring each model can prove computationally intractable. Considerations for model selection is the topic of today’s post.

Alekh Agarwal, UC Berkeley, presented his work on “Computation meets Statistics: Trade-offs and fundamental limits for large data sets” at Stanford’s Statistics seminar this afternoon.

There were several interesting ideas that I came away with from the talk:

  • Considering computational costs for M-estimators can be thought of in terms of the order of the number of search iterations required to achieve a particular level of precision.
  • It may be possible to construct a computational algorithm which may have a larger minimization error than the theoretical best, but its error may be of the same computational order as that best. In other words, a slight compromise between computational cost and bias/variance in estimation can be fruitful (i.e., O(B-B^) = O(B-B*) for some computationally simpler estimator B* of B).
  • Model selection can be very computationally difficult in high-dimensional models–one of the strengths of Big Data. Tradeoffs should be made regarding the number of samples, computational complexity, and communication costs (esp. for distributed computing).
  • A regularized objective function or otherwise constrained estimation framework can be applied to each of these tradeoffs to obtain a solution to the “budget constrained” model selection problem. This constrained problem can potentially have more favorable computational complexity than a brute force selection method.

In short, Big Data applications need to take these tradeoffs seriously. High-dimensional model selection is powerful, but constructing an algorithm which gives a good enough result today and can keep working on better results for tomorrow (with little-to-no intervention) is my ideal.

While I couldn’t find a copy of a paper (I believe it is still a work in progress), the abstract to the talk can be found below: Continue reading

Welcome to Econinformatics: Home of Economics & Big Data

You must be saying, “Econinformatics? What’s that?” Econinformatics is “the application of computer science and information technology to the field” of economics (Wikipedia: Bioinformatics), particularly as it applies to the economic analysis of Big Data:

Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. (Wikipedia: Big Data)

Note that there are orders of magnitude of difference between ˇ°capture, manage, and processˇ± and ˇ°analyze.ˇ±

This blog will provide a discussion platform for “Big Data” topics relevant to economists.

A recent article in Forbes titled “Big Data — Big Money Says It Is A Paradigm Buster” highlights the economic trajectory for Big Data in industry, and invariably, for applied economic research.

The fundamental issues associated with Big Data are succinctly described by Anand Rajaraman who is the senior vice president at Walmart Global e-commerce and co-founder @WalmartLabs and a professor at Stanford.

ˇ°The tools [for Big Data] are very different. Many of the fundamental algorithms for predictive analytics depend crucially on keeping the data in main memory with a single CPU to access it. Big Data breaks that condition. The data canˇŻt all be in memory at the same time, so it needs to be processed in a distributed fashion. That requires a new programming model.ˇ±

This can be hard for traditional data users to understand, He watches students attack Big Data problems by creating a sample, but that defeats the value of Big Data with all its potentially informative outliers.

The challenges are not only faced by “students.” These research and analytic challenges posed by Big Data are facing industry and academic researchers in many fields.

Researchers must undergo a paradigm shift in how they attack research with Big Data. I don’t pretend to have all of the answers about Big Data, but by sharing our knowledge and experiences together, we can shorten the learning curve and all do better work.