BlueData: Big Data Analysis Clouds on Enterprise Data Where it Already Lives

(Excerpt from original post on the Taneja Group News Blog)

(Quoting myself from a BlueData press release today -) “Taneja predicted that 2014 would be the year of virtualized Big Data computing and BlueData is proving that out,” said Mike Matchett, senior analyst and consultant at Taneja Group. “BlueData essentially virtualizes scale-out computing, turning a physical cluster into a Big Data cloud platform with elastic provisioning and policy-driven management. Best of all, BlueData helps companies leverage their Big Data wherever it currently exists, wherever it is, and streams it in with performance boosting technologies to the self-provisioning Hadoop/NoSQL cloud. With this leap, companies of all sizes can now readily make progress on broader, more aggressive Big Data visions.”

I’ve written before about the opportunities and benefits that virtualizing big data clusters can provide— especially in quick spin up/down use cases, migrations, and test/dev — and also about the various storage options for Hadoop (see our Taneja Group BrightTalk channel for some past presentations). Existing Hadoop virtual hosting solutions like VMware BDE and OpenStack Sahara Projects (fka Project Savanna) have proven out use cases, but fundamentally there is still the problem with how to best handle corporate data. If we virtualize HDFS nodes we aren’t going to tackle PB-scale data sets. If we go with native HDFS on commodity servers we’ll miss critical enterprise features. And if we try to use enterprise SAN’s, we suffer performance penalties, not to mention possibly dedicating expensive storage only to the cluster. (and copying big data sets to AWS ECS? yikes!)

We really only want one master copy of our data if we can help it, but it also must be secure, protected, shared in a workflow manner (file and transactional access through othter protocols), performant, and highly available. MapR might get us all that for physical Big Data clusters, but we need it for virtual compute clusters too. BlueData bridges this gap by providing for virtualized hosting of the compute side of the Hadoop (and other big data scale-out compute solutions) ecosystem, while baking in underneath an optimizing IO “service” that channels in existing enterprise storage, fronting it as HDFS to the virtually hosted Hadoop nodes. 

You could call this HDFS virtualization, but it’s not the virtual hosting of HDFS nodes as in BDE or Project Serengeti, or the complete remote indirection of HDFS like EMC Isilon offers. Rather its more like abstraction, like what IBM SVC does for regular storage. EMC’s ViPR HDFS used with VMware BDE might in some ways be seen as functionally comparable, but ViPR requires some modification to the Hadoop environment to work and isn’t integrated with BDE to provide any IO performance optimizations.

What are these performance optimizations?  One, BlueData a native caching solution underneath the virtual compute clusters called IOBoost, and a related SAN/NAS attachment facility called DataTap. Together these can be used to pull from and stream any existing data from where it sits into the virtualized clusters for analysis, without dedicating, duplicating or moving data unnecessarily. What I really like is that all of an organizations existing data processing can simply “share” their data from it’s existing storage with analytics running in the virtualized big data clusters. Internally, IT can now offer an elastic big data “cloud” on corporate data sets without having to stage, build, or maintain any new storage solutions.

Today’s news from BlueData is that they are offering a free (in perpetuity) 5-node license of their full enterprise EPIC platform, not just the free one-node community edition already available. With no restrictions on cores or storage and with the full cloud-like multi-tenancy provisioning and the ability to analyze existing data where it currently sits, it seems downright hard to not grab this free license and standup an internal big data cloud of your own. 

…(read the full post)

Virtualizing Hadoop Impacts Big Data Storage

An IT industry analyst article published by Enterprise Storage Forum.

by Mike Matchett, Sr. Analyst, Taneja Group
Hadoop is soon coming to enterprise IT in a big way. VMware’s new vSphere Big Data Extensions (BDE) commercializes its open source Project Serengeti to make it dead easy for enterprise admins to spin and up down virtual Hadoop clusters at will.

Now that VMware has made it clear that Hadoop is going to be fully supported as a virtualized workload in enterprise vSphere environments, here at Taneja Group we expect a rapid pickup in Hadoop adoption across organizations of all sizes.

However, Hadoop is all about mapping parallel compute jobs intelligently over massive amounts of distributed data. Cluster deployment and operation are becoming very easy for the virtual admin. But in a virtual environment where storage can be effectively abstracted from compute clients, there are some important complexities and opportunities to consider when designing the underlying storage architecture. Some specific concerns with running Hadoop in a virtual environment include considering how to configure virtual data nodes, how to best utilize local hypervisor server DAS, and when to think about leveraging external SAN/NAS.

The main idea behind virtualizing Hadoop is to take advantage of deploying Hadoop scale-out nodes as virtual machines instead of as racked commodity physical servers. Clusters can be provisioned on-demand and elastically expanded or shrunk. Multiple Hadoop virtual nodes can be hosted on each hypervisor physical server, and as virtual machines can be easily allocated more or less resource for a given application. Hypervisor level HA/FT capabilities can be brought to bear on production Hadoop apps. VMware’s BDE even includes QoS algorithms that help prioritize clusters dynamically, shrinking lower-priority cluster sizes as necessary to ensure high-priority cluster service.

…(read the complete as-published article there)