Virtualizing Hadoop Impacts Big Data Storage
An IT industry analyst article published by Enterprise Storage Forum.
by Mike Matchett, Sr. Analyst, Taneja Group
Hadoop is soon coming to enterprise IT in a big way. VMware’s new vSphere Big Data Extensions (BDE) commercializes its open source Project Serengeti to make it dead easy for enterprise admins to spin and up down virtual Hadoop clusters at will.
Now that VMware has made it clear that Hadoop is going to be fully supported as a virtualized workload in enterprise vSphere environments, here at Taneja Group we expect a rapid pickup in Hadoop adoption across organizations of all sizes.
However, Hadoop is all about mapping parallel compute jobs intelligently over massive amounts of distributed data. Cluster deployment and operation are becoming very easy for the virtual admin. But in a virtual environment where storage can be effectively abstracted from compute clients, there are some important complexities and opportunities to consider when designing the underlying storage architecture. Some specific concerns with running Hadoop in a virtual environment include considering how to configure virtual data nodes, how to best utilize local hypervisor server DAS, and when to think about leveraging external SAN/NAS.
The main idea behind virtualizing Hadoop is to take advantage of deploying Hadoop scale-out nodes as virtual machines instead of as racked commodity physical servers. Clusters can be provisioned on-demand and elastically expanded or shrunk. Multiple Hadoop virtual nodes can be hosted on each hypervisor physical server, and as virtual machines can be easily allocated more or less resource for a given application. Hypervisor level HA/FT capabilities can be brought to bear on production Hadoop apps. VMware’s BDE even includes QoS algorithms that help prioritize clusters dynamically, shrinking lower-priority cluster sizes as necessary to ensure high-priority cluster service.
…(read the complete as-published article there)