External storage might make sense for Hadoop

External storage might make sense for Hadoop

An IT industry analyst article published by SearchStorage.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage; a good old array might be a better choice.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage — good old external storage might be a better choice.

The original architectural design for Hadoop made use of relatively cheap commodity servers and their local storage in a scale-out fashion. Hadoop’s original goal was to enable cost-effective exploitation of data that was previously not viable. We’ve all heard about big data volume, variety, velocity and a dozen other “v” words used to describe these previously hard-to-handle data sets. Given such a broad target by definition, most businesses can point to some kind of big data they’d like to exploit.

Big data is growing bigger every day and storage vendors with their relatively expensive SAN and network-attached storage (NAS) systems are starting to work themselves into the big data party. They can’t simply leave all that data to server vendors filling boxes with commodity disk drives. Even if Hadoop adoption is just in its early stages, the competition and confusing marketing noise is ratcheting up.

In a Hadoop scale-out design, each physical node in the cluster hosts both local compute and a share of data; it’s intended to support applications, such as search, that often need to crawl through massively large data sets. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks across a scale-out cluster.

Hadoop is made up of a compute engine based on MapReduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop takes advantage of high data “locality” by spreading big data sets over many nodes using HDFS, farming out parallelized compute tasks to each data node (the “map” part of MapReduce), followed by various shuffling and sorting consolidation steps to produce a result (the “reduce” part).

Commonly, each HDFS data node will be assigned DAS disks to work with. HDFS will then replicate data across all the data nodes, usually making two or three copies on different data nodes. Replicas are placed on different server nodes, with the second replica placed on a different “rack” of nodes to help avoid rack-level loss. Obviously, replication takes up more raw capacity than RAID, but it also has some advantages like avoiding rebuild windows.

So if HDFS readily handles the biggest of data sets in a way native to the MapReduce style of processing, uses relatively cheap local disks and provides built-in “architecture-aware” replication, why consider enterprise-class storage? …

…(read the complete as-published article there)