Why Didn’t We Already Find What We’re Looking For?

Why Didn’t We Already Find What We’re Looking For?

building the data plotter

Image by !mz via Flickr

What we primarily look for in data is to make sense of it – find summaries and statistics to help inform analytical decision-making or discover patterns and stories creating new insights into the larger world behind the data.

This should all sound familiar if you are a Flowing Data blog fan as I am.  From author Nathan Yau in his book Visualize This – “Whatever you decide visualization is… you’re ultimately looking for the truth.” But the truth is hard to come by. Basically numbers don’t lie, people do – either on purpose or through incompetence.

Most of us have probably read How to Lie with Statistics, but with Big Data the dangers are multiplied by magnitudes. Search for the truth, always try to tell the truth but beware of people saying they have the big truth.

Big Data Visual Exploration

There are lots of tools to analyze and visualize non-Big Data (smaller data?).  But when we approach Big Data our options are almost by definition limited. In fact most definitions of Big Data are in terms of the constraints of current “smaller data” tools to handle it effectively.  What we do have currently is centered around map/reduce processing (see Hadoop) that essentially first makes smaller datasets for analysis (e.g. check out the free Infobright/Pentaho VM).

This map/reduce approach requiring low-level distributed programming isn’t well suited to serendipitous discovery by amateur data scientists, although there is ongoing work in this area (see Pig and Hive). There are also emerging companies specializing in automating the deep “data scientist” geekery to provide a “small data” exploration experience over Big Data sets (Opera Solutions, still stealthy Zillabyte?).

The real challenge is still that we don’t really know what we are looking for in Big Data sets before we find it – discovery more than answers to questions. And whatever it is, it probably wasn’t in the smaller data we already have made optimal use of (or not, most data goes unexamined even in non-big databases.).

Enhanced by Zemanta