(Excerpt from original post on the Taneja Group News Blog)
Today I had the opportunity to hear about Amazon’s Big Data solutions at an Amazon Web Services Big Data Summit here in Boston. At this event crowded with local tech talent from the hot bio, research, financial, and web industries, AWS showcased their EC2 compute cluster instances for HPC and their Elastic Map Reduce service that runs Hadoop in the cloud – trotting out several interesting real-world users.
John Rauser, an Amazon big data scientist, presented a thought provoking session on how Amazon uses and views Big Data. I wouldn’t want to shamelessly steal his material but I just have to relate his definition of Big Data. He said that you are really dealing with Big Data at the point when you have data that needs distributed processing. In other words, it’s Big because it’s more than one node or a single monolithic application can handle today, or even can be expected to handle “forever” as the dataset grows. Once you have to cross the threshold to “distributed” you have effectively entered the land of the Big.
In this view the effective Big Data market isn’t just the “extremely Big Data” of petabyte sized datasets that others are talking about. Those petabyte apps get a lot of news but they are out on the long tail of datasets. Rather, the Big Data opportunity is every potential analysis and app that is just out of reach of current single node IT systems, developers, and operators, up to and including the petabyte monsters. That covers a really broad swath of datasets that isn’t limited by a hard threshold on dataset size.
I really like that Big Data definition. It’s a practical and useful way to think about when you might and should get into Big Data technologies. And it clearly drives into Amazon’s strategies to enable both cost-effective analytical development and ongoing cluster operations for any indefinitely scalable dataset, even if those datasets are only a few hundred GB today. With all datasets conceivably growing large eventually, eventually all datasets will be Big Data. If you are developing any new data analytics apps, the time to develop it as a Big Data app might be now.
…(read the full post)