Big data analytics applications impact storage systems

An IT industry analyst article published by SearchStorage.


Whether driven by direct competition or internal business pressure, CIOs, CDOs and even CEOs today are looking to squeeze more value, more insight and more intelligence out of their data. They no longer can afford to archive, ignore or throw away data if it can be turned into a valuable asset. At face value, it might seem like a no-brainer — “we just need to analyze all that data to mine its value.” But, as you know, keeping any data, much less big data, has a definite cost. Processing larger amounts of data at scale is challenging, and hosting all that data on primary storage hasn’t always been feasible.

Historically, unless data had some corporate value — possibly as a history trail for compliance, a source of strategic insight or intelligence that can optimize operational processes — it was tough to justify keeping it. Today, thanks in large part to big data analytics applications, that thinking is changing. All of that bulky low-level bigger data has little immediate value, but there might be great future potential someday, so you want to keep it — once it’s gone, you lose any downstream opportunity.

To extract value from all that data, however, IT must not only store increasingly large volumes of data, but also architect systems that can process and analyze it in multiple ways.

…(read the complete as-published article there)

Navigate data lakes to manage big data

An IT industry analyst article published by SearchStorage.


Big data sure is exciting to business folks, with all sorts of killer applications just waiting to be discovered. And you no doubt have a growing pile of data bursting the seams of your current storage infrastructure, with lots of requests to mine even more voluminous data streams. Haven’t you been collecting microsecond end-user behavior across all your customers and prospects, not to mention collating the petabytes of data exhaust from instrumenting your systems to the nth degree? Imagine the insight management would have if they could look at all that data at once. Forget about data governance, data management, data protection and all those other IT worries — you just need to land all that data in a relatively scale-cheap Hadoop cluster!

Seriously, though, big data lakes can meet growing data challenges and provide valuable new services to your business. By collecting a wide variety of data sets relevant to the business all in one place and enabling multi-talented analytics based on big data approaches that easily scale, many new data mining opportunities can be created. The total potential value of a data lake grows with the amount of useful data it holds available for analysis. And, one of the key tenets of big data and the big data lake concept is that you don’t have to create a master schema ahead of time, so non-linear growth is possible.

The enterprise data lakes or hub concept was first proposed by big data vendors like Cloudera and Hortonworks, ostensibly using vanilla scale-out HDFS-based commodity storage. But it just so happens that the more data you keep on hand, the more storage of all kinds you will need. Eventually, all corporate data is likely to be considered big data. However, not all of that corporate data is best hosted on a commodity scale-out HDFS cluster.

So, today, traditional storage vendors are signing up to the big data lakes vision. From a storage marketing perspective, it seems like data lakes are the new cloud. “Everyone needs a data lake. How can you compete without one (or two or three)?” And there are a variety of enterprise storage options for big data, including enterprise storage, that can provide remote storage that acts like HDFS, Hadoop virtualization that can translate other storage protocols into HDFS, and scalable software-defined storage options.

…(read the complete as-published article there)

New choices bring enterprise big data home

An IT industry analyst article published by SearchDataCenter.

Enterprises recognize the tantalizing value of big data analytics, but traditional concerns about data management and security have held back deployments — until now.


article_New-choices-bring-enterprise-big-data-home
When big data practices come to your organization, it’s all about location, location, location.

I’ve heard recently from a bunch of big-data-related vendors that are all vying to gain from your sure-to-grow big data footprint. After all, big data isn’t about minimizing your data set, but making the best use of as much data as you can possibly manage. That’s not a bad definition of big data if you are still looking for one. With all this growing data, you will need a growing data center infrastructure to match.

This big data craze really got started with Apache Hadoop’s Distributed File System (HDFS), which unlocked the vision of massive data analysis based on cost-effective scale-out clusters of commodity servers using relatively cheap local attached disks. Hadoop and its ecosystem of solutions let you keep and analyze all kinds of data in its natural raw low-level form (i.e., not fully database structured), no matter how much you pile up or how fast it grows.

The problem is that once you get beyond, err, data science projects, old familiar enterprise data management issues return to the forefront, including data security, protection, reliability, operational performance and creeping Opex costs.

While Hadoop and HDFS mature with each release, there are still a lot of gaps when it comes to meeting enterprise requirements. It turns out that those commodity scale-out clusters of direct-attached storage (DAS) might not actually offer the lowest total cost of ownership when big data lands in production operations…

…(read the complete as-published article there)

Snakes in the Data Center: EMC ViPR Slithers In

(Excerpt from original post on the Taneja Group News Blog)

Storage experts know that there are two ways to handle crushing data growth – the kind of growth that exceeds our traditional scale-up storage array capabilities (in one way or another). The bad way is to keep plopping down more copies of those arrays which tends to spiral OPEX out of control – there isn’t as much OPEX efficiency at scale as we might naively think.

…(read the full post)

External storage might make sense for Hadoop

An IT industry analyst article published by SearchStorage.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage; a good old array might be a better choice.


article_External-storage-might-make-sense-for-Hadoop
Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage — good old external storage might be a better choice.

The original architectural design for Hadoop made use of relatively cheap commodity servers and their local storage in a scale-out fashion. Hadoop’s original goal was to enable cost-effective exploitation of data that was previously not viable. We’ve all heard about big data volume, variety, velocity and a dozen other “v” words used to describe these previously hard-to-handle data sets. Given such a broad target by definition, most businesses can point to some kind of big data they’d like to exploit.

Big data is growing bigger every day and storage vendors with their relatively expensive SAN and network-attached storage (NAS) systems are starting to work themselves into the big data party. They can’t simply leave all that data to server vendors filling boxes with commodity disk drives. Even if Hadoop adoption is just in its early stages, the competition and confusing marketing noise is ratcheting up.

In a Hadoop scale-out design, each physical node in the cluster hosts both local compute and a share of data; it’s intended to support applications, such as search, that often need to crawl through massively large data sets. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks across a scale-out cluster.

Hadoop is made up of a compute engine based on MapReduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop takes advantage of high data “locality” by spreading big data sets over many nodes using HDFS, farming out parallelized compute tasks to each data node (the “map” part of MapReduce), followed by various shuffling and sorting consolidation steps to produce a result (the “reduce” part).

Commonly, each HDFS data node will be assigned DAS disks to work with. HDFS will then replicate data across all the data nodes, usually making two or three copies on different data nodes. Replicas are placed on different server nodes, with the second replica placed on a different “rack” of nodes to help avoid rack-level loss. Obviously, replication takes up more raw capacity than RAID, but it also has some advantages like avoiding rebuild windows.

So if HDFS readily handles the biggest of data sets in a way native to the MapReduce style of processing, uses relatively cheap local disks and provides built-in “architecture-aware” replication, why consider enterprise-class storage? …

…(read the complete as-published article there)

Myths surrounding big data technology

An IT industry analyst article published by SearchStorage.

Big data technology is a big deal for storage shops, and a clear understanding of what it means is required to configure storage for big data apps.


article_Myths-surrounding-big-data-technology
I love the idea of changing the world through big data technology. Big data promises we’ll all be IT superheroes just by storing more raw data than ever before and then using parallel processing techniques to yield great new insights that will catapult our company to the top. Good storage is costly and the rate that interesting new data is produced increases daily, but the Apache Hadoop product calls for leveraging scale-out commodity server nodes with cheap local disk.

Of course, there’s more to it. Conceptually, big data products bring new ways to store and analyze the mountains of data that we used to discard. There’s certainly information and insight to be mined, but the definitions are fuzzy, the hype is huge and the mining technologies themselves are still rapidly evolving.

Adding to the confusion, big data technology has been enthusiastically marketed by just about every storage vendor on the planet. But despite the marketing, I believe it’s just a matter of time before every competitive IT shop has a real big-data solution to implement or manage, if only because of staggering data growth. For those just setting out on a big data journey, watch out for these common myths.

Myth No. 1: Just do it

A sure way to waste a lot of money is to aggregate tons of data on endlessly scalable clusters and hope that your star data scientist will someday discover the hidden keys to eternal profit.

To succeed with any IT project, big data included, you need to have a business value proposition in mind and an achievable plan laid out. Research is good and those “aha” moments can be exciting, but by the time big data gets to IT, there needs to be a more practical goal than just a desire to “see what might be in there.”

Myth No. 2: Store everything

…(read the complete as-published article there)

Nothing’s Too Fast for Operational Intelligence – ScaleOut Software’s hServer

(Excerpt from original post on the Taneja Group News Blog)

There are a lot of HPC technologies coming soon to a data center near you! The latest offering from ScaleOut Software, known for their in-memory data grid solutions, is a customized in-memory data grid for Hadoop. This enables a blistering fast big data style real-time analysis of dynamically changing data.  Solutions that use this are processing live operational data into actionable intelligence – financials, reservation systems, live customer experience, etc. 

…(read the full post)