(Excerpt from original post on the Taneja Group News Blog)
(Quoting myself from a BlueData press release today -) “Taneja predicted that 2014 would be the year of virtualized Big Data computing and BlueData is proving that out,” said Mike Matchett, senior analyst and consultant at Taneja Group. “BlueData essentially virtualizes scale-out computing, turning a physical cluster into a Big Data cloud platform with elastic provisioning and policy-driven management. Best of all, BlueData helps companies leverage their Big Data wherever it currently exists, wherever it is, and streams it in with performance boosting technologies to the self-provisioning Hadoop/NoSQL cloud. With this leap, companies of all sizes can now readily make progress on broader, more aggressive Big Data visions.”
I’ve written before about the opportunities and benefits that virtualizing big data clusters can provide— especially in quick spin up/down use cases, migrations, and test/dev — and also about the various storage options for Hadoop (see our Taneja Group BrightTalk channel for some past presentations). Existing Hadoop virtual hosting solutions like VMware BDE and OpenStack Sahara Projects (fka Project Savanna) have proven out use cases, but fundamentally there is still the problem with how to best handle corporate data. If we virtualize HDFS nodes we aren’t going to tackle PB-scale data sets. If we go with native HDFS on commodity servers we’ll miss critical enterprise features. And if we try to use enterprise SAN’s, we suffer performance penalties, not to mention possibly dedicating expensive storage only to the cluster. (and copying big data sets to AWS ECS? yikes!)
We really only want one master copy of our data if we can help it, but it also must be secure, protected, shared in a workflow manner (file and transactional access through othter protocols), performant, and highly available. MapR might get us all that for physical Big Data clusters, but we need it for virtual compute clusters too. BlueData bridges this gap by providing for virtualized hosting of the compute side of the Hadoop (and other big data scale-out compute solutions) ecosystem, while baking in underneath an optimizing IO “service” that channels in existing enterprise storage, fronting it as HDFS to the virtually hosted Hadoop nodes.
You could call this HDFS virtualization, but it’s not the virtual hosting of HDFS nodes as in BDE or Project Serengeti, or the complete remote indirection of HDFS like EMC Isilon offers. Rather its more like abstraction, like what IBM SVC does for regular storage. EMC’s ViPR HDFS used with VMware BDE might in some ways be seen as functionally comparable, but ViPR requires some modification to the Hadoop environment to work and isn’t integrated with BDE to provide any IO performance optimizations.
What are these performance optimizations? One, BlueData a native caching solution underneath the virtual compute clusters called IOBoost, and a related SAN/NAS attachment facility called DataTap. Together these can be used to pull from and stream any existing data from where it sits into the virtualized clusters for analysis, without dedicating, duplicating or moving data unnecessarily. What I really like is that all of an organizations existing data processing can simply “share” their data from it’s existing storage with analytics running in the virtualized big data clusters. Internally, IT can now offer an elastic big data “cloud” on corporate data sets without having to stage, build, or maintain any new storage solutions.
Today’s news from BlueData is that they are offering a free (in perpetuity) 5-node license of their full enterprise EPIC platform, not just the free one-node community edition already available. With no restrictions on cores or storage and with the full cloud-like multi-tenancy provisioning and the ability to analyze existing data where it currently sits, it seems downright hard to not grab this free license and standup an internal big data cloud of your own.
…(read the full post)