Saturday, February 13, 2010

Big Data

A bunch of things happened that made me think about big data.
  • A dinner conversation yesterday brought up NSF's general emphasis on data-intensive computing and specific example like the center on Foundations of Data and Visual Analytics at GaTech or supporting infrastructure for Hadoop/MapReduce/Azure in Academia. The following are given: there is a lot of data, lot of researchers want to analyze them, there are real performance bottlenecks in analyses they need, and we are in a position to throw processors at them. But IMHO, the CS research community has not yet abstracted clean models --- for building, for analyzing -- that will truly address this state of world. We need a Valiant-esque insight and effort here, something that threads between H/M/A approach for special tasks, PRAM/BSP for general parallel computing and currently popular multicores. Traditional appreciation of costs of moving data and computing state between processors or synchronicity don't hold, and one has to endogenize the fact that reading from a remote processor's main memory is cheaper than reading local disk given the communication infrastructure in data centers.
  • Then there is the practice of handling large data. There is a new DHS center for Advanced Data Analysis (CCIADA) at Rutgers U. I am the Director of Data Research. Many organizations across the country have various datasets with complex rules for access, use, and at a higher level, what can be inferred from them. IMHO, the research community is far from formulating a model for working with such data constellations, with data mapping, provenance, and trust issues, a model which will support some algebra on top of the data and instinctively automate data handling issues. This is a big bottleneck for research to flourish.
  • Finally, gmail ads are an interesting beast. Sometimes I am unware of them, some times I find them entertaining trying to figure what drives ad systems to map my emails to the specific ads they show, and once in a while, an ad sneaks up on me and I follow the lead. This morning I saw an ad for the Big Data Summit (last year's here). Like other industry meetings on this topic, this meeting too seems to bring together the right players who want to solve the problems, but I am not sure the industry has novel insights into the big problems here. Apologies for speaking without going to the summit.

Labels:

1 Comments:

Anonymous proaonuiq said...

A quote from the FODAVA site you link to:

"Enormous amounts of data are being generated every day in health care, computational biology, homeland security, commerce, and many other areas. Analyzing these massive and complex data sets is essential to achieve new discoveries, but extremely difficult. An emerging research field known as data and visual analytics is concerned with synthesizing information and deriving insight from massive, dynamic, ambiguous and possibly conflicting digital data for increased understanding and effective decision making."

At a diferent level, that problem seems the same as a set of cells bombarded with huge amounts of physical stimulus of a different kind (chemoreception, photoreception, mechanoreception, thermoreception etc...), trying to extract usefull information from it(i.e. knwoledge about where the free energy is around and how to extract it for biological purposes). The best solution to this problem at the cell level, up to now, has been the human mind. So what you are asking for is for the design of a kind of a social mind. Not so easy, i guess.

4:11 AM  

Post a Comment

<< Home