Image by !mz via Flickr
What we primarily look for in data is to make sense of it – find summaries and statistics to help inform analytical decision-making or discover patterns and stories creating new insights into the larger world behind the data.
This should all sound familiar if you are a Flowing Data blog fan as I am. From author Nathan Yau in his book Visualize This – “Whatever you decide visualization is… you’re ultimately looking for the truth.” But the truth is hard to come by. Basically numbers don’t lie, people do – either on purpose or through incompetence.
Most of us have probably read How to Lie with Statistics, but with Big Data the dangers are multiplied by magnitudes. Search for the truth, always try to tell the truth but beware of people saying they have the big truth.
Big Data Visual Exploration
There are lots of tools to analyze and visualize non-Big Data (smaller data?). But when we approach Big Data our options are almost by definition limited. In fact most definitions of Big Data are in terms of the constraints of current “smaller data” tools to handle it effectively. What we do have currently is centered around map/reduce processing (see Hadoop) that essentially first makes smaller datasets for analysis (e.g. check out the free Infobright/Pentaho VM).
This map/reduce approach requiring low-level distributed programming isn’t well suited to serendipitous discovery by amateur data scientists, although there is ongoing work in this area (see Pig and Hive). There are also emerging companies specializing in automating the deep “data scientist” geekery to provide a “small data” exploration experience over Big Data sets (Opera Solutions, still stealthy Zillabyte?).
The real challenge is still that we don’t really know what we are looking for in Big Data sets before we find it – discovery more than answers to questions. And whatever it is, it probably wasn’t in the smaller data we already have made optimal use of (or not, most data goes unexamined even in non-big databases.).
Taken together, the V-word characteristics of Big Data both identify and shape the kinds of innovative solutions that can be created from Big Data opportunities. These solutions will tend to provide intelligence more than absolute truth.
Disruption is the Real Opportunity
Image by NASA Goddard Photo and Video via Flickr
It’s worth keeping in mind that adding Big Data Analysis to a current business isn’t the whole enchilada. Having better intelligence than the next guy is a great competitive advantage, but in itself isn’t “disruptive.” The idea that Big Data will enable game-changing new business opportunities, not simply adding insight into current processes or decision-support practices, is why Big Data Analysis is exciting.
Entrepreneurs who create new ways of doing business fueled by Big Data intelligence will dominate. The key to the difference between improving current business and innovative disruption is looking for answers to new and different questions. Sounds easy enough but that is truly difficult creative work.
Big Data Doesn’t Come with an Instruction Manual
Big Data sets don’t start with a schema model that defines the answers “findable” within them. It’s not just a huge BI warehouse. Rather, it takes a cunning mind and a dedicated soul to explore through Big Data – for example trying various map/reduce algorithms to find new patterns and assembling new visualizations discovering new ways of looking and seeing.
This skilled data mining and keen perceptive ability must be fused with an entrepreneurial mindset that is always evaluating how any new big data intelligence could be formed into new and ultimately disruptive innovation.
There are lots of definitions of Big Data. Most of them are fuzzy marketing speak along the lines of “Big Data is just bigger than your old data, too big to deal with the same way you dealt with data before.” Amusingly a lot of examples being given for “historical” Big Data successes are based on traditional data methods and technologies applied overlarge amounts of traditional data.
Image by Idaho National Laboratory via Flickr
Clearly there is something new happening with the way we can get value out of very large data sets, but it’s really hard to see what the line really is between Big Data and not-so-Big Data. Ironically most pundits seem to be saying we can spot Big Data the same way we know what’s obscene – we’d simply recognize it when we see it. The irony of course is that Big Data is just too big to see, or visualize as it is.
Think how big a picture it would take to show a 5 Pb Big Data set at one pixel per data point.
Big Data by the V-words
I’ve read more than a few definitions that talk about some clever V-word characteristics that Big Data scientists need to be concerned with:
- Volume – Obviously Big Data is Big.
- Variety – Many identified Big Data sets are internally heterogeneous (e.g. big data documents). The data isn’t collected or authored according to a single master schema.
- Velocity – Big Data sets tend to grow rapidly, even as we use them. Implies some dynamic and possible real-time behavior as well.
I’d add a fourth V:
- Veracity – Or rather, the lack thereof. Raw Big Data is often not verifiable/verified nor validated (until processed for that goal specifically, e.g. security fraud). Analysis can’t always be duplicated (as data keeps growing/changing). Duplication, omission, and general incompleteness are to be expected.
It may be impossible to repeat the same analysis definitively on a truly big “big data” set. If results can’t be exactly reproduced (or explained back to raw data), they can’t serve as literal truth.
Your life so far has been a big data trail for someone else to mine.
Image by Xpectro via Flickr
Google took what was essentially crumbs of data left by millions (billions?) of people as they navigated around the internet, compiled and analyzed it into an index of how relevant and popular any place is that you want to visit. As they compile more bits of information about you and your social circles and browsing history (and recommendations and…), your lifetime becomes laid bare to their ultimately commercial interest.
Privacy is being hotly debated in some circles but most are not even aware of what is at stake. For some the world has evolved and we can no longer apply past expectations of privacy to constructs and capabilities emerging today – the new world is a shared one. For others, any data associated with their personal identification is off-limits.
There is a new huge privacy conflict dead ahead. Continue reading
Despite bigger and bigger data, the world is a small place and it is full of people. Increasingly networked people. I like Clay Shirky’s thinking in Here Comes Everybody about new ways people online can gather and form loose communities whose effectiveness is multiplied by new found freedoms and capabilities for distributed but coordinated group action. (Twitter doesn’t topple governments, people linked by Twitter do.)
In Cognitive Surplus he writes about the ability to harness huge untapped human potential. For example, the average Westernized civilization’s tuned-out TV time represents a significant amount of lost “cognition”. If it were possible to recover just a small percentage of that wasted human capital in the pursuit of just about anything, tremendous things could happen. Given the emerging abilities of internet societies to both encourage and allow everyone to contribute, we might be at the start of a tremendous acceleration in human achievement (e.g. see how online gamers solve aids protein puzzle).