Data Stream Mining with Cube

Time-series data analysis can be approached in two ways. Traditionally time-series data is aggregated into partitioned historical data bases, and then reported on at scheduled intervals. Commonly, reports delivered today cover data collected yesterday. A modern (and perhaps most relevant to Big Data) approach is to recognize that time-series data just “keeps coming”. And since the timeliest analysis could theoretically deliver the most value, visualizations should update as soon as the data streams in.

Square’s evolving Cube library (it’s still early version 0) enables web developers to easily deliver real-time charting of streaming time-series data on dynamic web pages:

Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards.

I’ve spend a large chunk of my professional life working at IT system management vendors, each of whom spent significant resources to build and deliver proprietary event and time-series data analysis and visualization tools. In the last few years there have been successful open source discrete event monitoring and management tools (threshold, alert, etc) that really disrupted the market of old school proprietary event solutions.  Open source time-series solutions like Cube have similar potential to disrupt proprietary time-series analysis markets.

Time-Series Data Stream Mining

Real-time time-series visualization is fundamentally data stream mining, maybe not at Big Data scales but certainly there are some hints about the future for Big Data stream mining in the way Cube is architected. Continue reading

What is the Question?

The answer I’m sure is innovation.

Practically the first thing to do is figure out the questions to ask. Don’t stick to just questions that are hanging out there already needing to be answered, but create new questions that you couldn’t answer before you had your Big Data. Don’t forget that the data you have isn’t limited to what’s in-house, you can find and mashup “tons” of public, government, and licensed data sets.

Data mining, just like data visualization, is as much art as science…

When You Have a Traditional Question, All Data Looks Traditional

Mine near Woodburn, Oregon

Old Mine - Image by OSU Special Collections & Archives via Flickr

Is the challenge simply to map and reduce the Big Data into smaller data so we can look at it the same way we always have? So we can support the same business processes, the same decision-making? Answer the same questions but at larger scale perhaps?

The real challenge to think differently – to ask different questions that can only be answered by unlocking the Big Information spread over the Big Data. The whole process from data gathering through mining, analysis and visualization and presentation needs to be designed to help create and answer these new and different questions.

Enhanced by Zemanta

Big Data Analytics – Intelligence for Disruption

Taken together, the V-word characteristics of Big Data both identify and shape the kinds of innovative solutions that can be created from Big Data opportunities.  These solutions will tend to provide intelligence more than absolute truth.

Disruption is the Real Opportunity

Hurricane Irene Makes Landfall in North Carolina

Image by NASA Goddard Photo and Video via Flickr

It’s worth keeping in mind that adding Big Data Analysis to a current business isn’t the whole enchilada. Having better intelligence than the next guy is a great competitive advantage, but in itself isn’t “disruptive.” The idea that Big Data will enable game-changing new business opportunities, not simply adding insight into current processes or decision-support practices, is why Big Data Analysis is exciting.

Entrepreneurs who create new ways of doing business fueled by Big Data intelligence will dominate. The key to the difference between improving current business and innovative disruption is looking for answers to new and different questions. Sounds easy enough but that is truly difficult creative work.

Big Data Doesn’t Come with an Instruction Manual

Big Data sets don’t start with a schema model that defines the answers “findable” within them. It’s not just a huge BI warehouse. Rather, it takes a cunning mind and a dedicated soul to explore through Big Data – for example trying various map/reduce algorithms to find new patterns and assembling new visualizations discovering new ways of looking and seeing.

This skilled data mining and keen perceptive ability must be fused with an entrepreneurial mindset that is always evaluating how any new big data intelligence could be formed into new and ultimately disruptive innovation.

Big Data Defined by the V’s

There are lots of definitions of Big Data. Most of them are fuzzy marketing speak along the lines of “Big Data is just bigger than your old data, too big to deal with the same way you dealt with data before.”  Amusingly a lot of examples being given for “historical” Big Data successes are based on traditional data methods and technologies applied overlarge amounts of traditional data.

Data Represented in an Interactive 3-D Form

Image by Idaho National Laboratory via Flickr

Clearly there is something new happening with the way we can get value out of very large data sets, but it’s really hard to see what the line really is between Big Data and not-so-Big Data. Ironically most pundits seem to be saying we can spot Big Data the same way we know what’s obscene  – we’d simply recognize it when we see it. The irony of course is that Big Data is just too big to see, or visualize as it is.

Think how big a picture it would take to show a 5 Pb Big Data set at one pixel per data point.

Big Data by the V-words

I’ve read more than a few definitions that talk about some clever V-word characteristics that Big Data scientists need to be concerned with:

  1. Volume – Obviously Big Data is Big.
  2. Variety – Many identified Big Data sets are internally heterogeneous (e.g. big data documents).  The data isn’t collected or authored according to a single master schema.
  3. Velocity – Big Data sets tend to grow rapidly, even as we use them.  Implies some dynamic and possible real-time behavior as well.

I’d add a fourth V:

  1. Veracity – Or rather, the lack thereof.  Raw Big Data is often not verifiable/verified nor validated (until processed for that goal specifically, e.g. security fraud). Analysis can’t always be duplicated (as data keeps growing/changing). Duplication, omission, and general incompleteness are to be expected.

It may be impossible to repeat the same analysis definitively on a truly big “big data” set.  If results can’t be exactly reproduced (or explained back to raw data), they can’t serve as literal truth.

Enhanced by Zemanta