Visualizing (and Optimizing) Cluster Performance

(Excerpt from original post on the Taneja Group News Blog)

Clusters are the scale-out way to go in today’s data center. Why not try to architect an infrastructure that can grow linearly in capacity and/or performance? Well, one problem is that operations can get quite complex especially when you start mixing workloads and tenants on the same cluster. In vanilla big data solutions everyone can compete, and not always fairly, for the same resources. This is a growing problem in production environments where big data apps are starting to underpin key business-impacting processes.

…(read the full post)

Big data analytics applications impact storage systems

An IT industry analyst article published by SearchStorage.


Whether driven by direct competition or internal business pressure, CIOs, CDOs and even CEOs today are looking to squeeze more value, more insight and more intelligence out of their data. They no longer can afford to archive, ignore or throw away data if it can be turned into a valuable asset. At face value, it might seem like a no-brainer — “we just need to analyze all that data to mine its value.” But, as you know, keeping any data, much less big data, has a definite cost. Processing larger amounts of data at scale is challenging, and hosting all that data on primary storage hasn’t always been feasible.

Historically, unless data had some corporate value — possibly as a history trail for compliance, a source of strategic insight or intelligence that can optimize operational processes — it was tough to justify keeping it. Today, thanks in large part to big data analytics applications, that thinking is changing. All of that bulky low-level bigger data has little immediate value, but there might be great future potential someday, so you want to keep it — once it’s gone, you lose any downstream opportunity.

To extract value from all that data, however, IT must not only store increasingly large volumes of data, but also architect systems that can process and analyze it in multiple ways.

…(read the complete as-published article there)

What was BIG at Hadoop Summit 2015

(Excerpt from original post on the Taneja Group News Blog)

At this month’s Hadoop Summit 2015 I noted two big trends. One was the continuing focus on Spark as an expansion of the big data analytical ecosystem, with main sponsor Hortonworks (great show by the way!) and most vendors talking about how they support, interact, or deliver Spark in addition to Hadoop’s MapReduce. The other was a very noticeable direction shifting focus from trotting out ever more gee-whiz big data use cases towards talking about how to make it all work in enterprise production environments. If you ask me, this second trend is the bigger deal for IT folks to pay attention to.

…(read the full post)

Navigate data lakes to manage big data

An IT industry analyst article published by SearchStorage.


Big data sure is exciting to business folks, with all sorts of killer applications just waiting to be discovered. And you no doubt have a growing pile of data bursting the seams of your current storage infrastructure, with lots of requests to mine even more voluminous data streams. Haven’t you been collecting microsecond end-user behavior across all your customers and prospects, not to mention collating the petabytes of data exhaust from instrumenting your systems to the nth degree? Imagine the insight management would have if they could look at all that data at once. Forget about data governance, data management, data protection and all those other IT worries — you just need to land all that data in a relatively scale-cheap Hadoop cluster!

Seriously, though, big data lakes can meet growing data challenges and provide valuable new services to your business. By collecting a wide variety of data sets relevant to the business all in one place and enabling multi-talented analytics based on big data approaches that easily scale, many new data mining opportunities can be created. The total potential value of a data lake grows with the amount of useful data it holds available for analysis. And, one of the key tenets of big data and the big data lake concept is that you don’t have to create a master schema ahead of time, so non-linear growth is possible.

The enterprise data lakes or hub concept was first proposed by big data vendors like Cloudera and Hortonworks, ostensibly using vanilla scale-out HDFS-based commodity storage. But it just so happens that the more data you keep on hand, the more storage of all kinds you will need. Eventually, all corporate data is likely to be considered big data. However, not all of that corporate data is best hosted on a commodity scale-out HDFS cluster.

So, today, traditional storage vendors are signing up to the big data lakes vision. From a storage marketing perspective, it seems like data lakes are the new cloud. “Everyone needs a data lake. How can you compete without one (or two or three)?” And there are a variety of enterprise storage options for big data, including enterprise storage, that can provide remote storage that acts like HDFS, Hadoop virtualization that can translate other storage protocols into HDFS, and scalable software-defined storage options.

…(read the complete as-published article there)

Project Myriad Will Become Your Next Data Center Platform

(Excerpt from original post on the Taneja Group News Blog)

One of the big things bubbling around at Strata this week is talk about YARN, Mesos, and Project Myriad (initiated/sponsored by MapR).  One on hand it seems that this is just about some evolution of the Hadoop scheduling layer, but looking at with a critical eye, I see the impending culmination of what I predicted years ago – that the Hadoop ecosystem will quickly evolve to bring high-powered HPC technologies right into the heart of the next gen enterprise data center.

…(read the full post)

New choices bring enterprise big data home

An IT industry analyst article published by SearchDataCenter.

Enterprises recognize the tantalizing value of big data analytics, but traditional concerns about data management and security have held back deployments — until now.


article_New-choices-bring-enterprise-big-data-home
When big data practices come to your organization, it’s all about location, location, location.

I’ve heard recently from a bunch of big-data-related vendors that are all vying to gain from your sure-to-grow big data footprint. After all, big data isn’t about minimizing your data set, but making the best use of as much data as you can possibly manage. That’s not a bad definition of big data if you are still looking for one. With all this growing data, you will need a growing data center infrastructure to match.

This big data craze really got started with Apache Hadoop’s Distributed File System (HDFS), which unlocked the vision of massive data analysis based on cost-effective scale-out clusters of commodity servers using relatively cheap local attached disks. Hadoop and its ecosystem of solutions let you keep and analyze all kinds of data in its natural raw low-level form (i.e., not fully database structured), no matter how much you pile up or how fast it grows.

The problem is that once you get beyond, err, data science projects, old familiar enterprise data management issues return to the forefront, including data security, protection, reliability, operational performance and creeping Opex costs.

While Hadoop and HDFS mature with each release, there are still a lot of gaps when it comes to meeting enterprise requirements. It turns out that those commodity scale-out clusters of direct-attached storage (DAS) might not actually offer the lowest total cost of ownership when big data lands in production operations…

…(read the complete as-published article there)

A Shift To Enterprise Features For Big Data Solutions: Notes from Strata/Hadoop World NY 2014

(Excerpt from original post on the Taneja Group News Blog)

I had a blast last week at Strata/Hadoop World NY 2014. I got a real sense that the mass of big data sponsors/vendors are finally focusing on what it takes to get big data solutions into production operations. In fact, in one of the early keynotes it was noted that the majority of the attendees were implementing software engineers and not necessarily analytical data scientists. Certainly there was no shortage of high profile use cases bandied about and impressive sessions on advanced data science, but on the show floor much of the talk was about making big data work in real world data centers.

I’ll certainly be diving into many of these topics more deeply, but here is a not-so-brief roundup of major themes culled from the 20+ sponsors I met with at the show:

…(read the full post)

A Billion Here, A Billion There – Big Data Is Big Money

(Excerpt from original post on the Taneja Group News Blog)

When we talk about big data today we aren’t talking just about the data and its three V’s (or up to 15 depending on who you consult), but more and more about the promise of big transformation to the data center. In other words, it’s about big money.

First, consider recent news about some key Hadoop distro vendors. Many of them are now billion dollar players, much of that on speculation and expectation of future data center occupation. When Pivotal spun off from EMC it got to start with a gaggle of successful commercially deployed products giving it a tremendous day one revenue stream. With GE’s 10% outside stake at $105M that made them a billion dollar startup. Coming back from the Cloudera Analyst Event last month we found that Cloudera was doing really well with $160M in new funding, but soon thereafter Intel weighed in to top them up over a billion in funding (valuation at 4.1B). Not to be left out in the cold, Hortonworks announced a $100M round that valued them at $1B (ostensibly they claim they could take in 20x more, but are raising funds as they need them).

Second, consider the infrastructure that not just billions but trillions and more (gadzillions?) pieces of data have to still land on, even if made up of commodity disks/server clusters. Of course most companies are going to want to build out big data solutions, or they risk getting left behind competitively. But many of these are going to eventually turn into massive investments that only grow as the data grows (i.e. predicted to be exponential!) and occupy more and more of the data center, not stay constrained as little R+D projects or simple data warehouse offloading platforms.

Cleary big data is now a playing field for competition amongst billionaires.  I’m sure the lot of startups in that space are only encouraged by the ecosystem wealth and opportunity for acquisition, but as the big money grows, keep an eye on how standards and open source foundations increasingly become political battlefields, with results not always in the best interest of the end user. 

While there is an open source model underpinning the whole ecosystem, with this much money on the table it will be interesting to see how fair competition plays out. From my perspective it looks like big data isn’t going to be very free, or there wouldn’t be billions of dollars in bets being made. Up till now most of the ecosystem vendors have been making arguments about providing better support than the other guy.  In that academic view, there is not much call for outside analysis or third party validation.

But every big, big data vendor we talk now to has some proprietary angle on how they do things better than the next guy – with lurking implied vendor lock-in –  based on how enterprises can effectively manage big data or extract maximal value from it. Which sounds like the current IT vendor ecosystem we know and love.  And which requires some analysis and validation to separate the wheat from the chaff, the real from the hype.

As an IT organization faced with big data challenges, how do you feel about suddenly dealing with billion dollar behemoths in a space founded on open source principles? In the end, it doesn’t really impact our recommended approach – you need to have enterprise capabilities for big data, and you always were likely to get the best of those from vendors with highly competitive proprietary technology. We’ve started working with big data vendors now as real IT solution vendors. In our book, Pivotal, Cloudera, Hortonworks and the like have simply graduated into the full-fledged IT vendor category, which can only help the IT organization faced with enterprise-level big data challenges.

…(read the full post)

Application Performance Management (APM) For Big Data

(Excerpt from original post on the Taneja Group News Blog)

Concurrent, the folks behind Cascading, have today announced the beta of “Driven” – an Application Performance Management (APM) solution for Hadoop. APM has been sorely missing from the Hadoop ecosystem at a level in which developers, IT ops, and even end users can quickly get to the bottom of any issues.

…(read the full post)

External storage might make sense for Hadoop

An IT industry analyst article published by SearchStorage.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage; a good old array might be a better choice.


article_External-storage-might-make-sense-for-Hadoop
Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage — good old external storage might be a better choice.

The original architectural design for Hadoop made use of relatively cheap commodity servers and their local storage in a scale-out fashion. Hadoop’s original goal was to enable cost-effective exploitation of data that was previously not viable. We’ve all heard about big data volume, variety, velocity and a dozen other “v” words used to describe these previously hard-to-handle data sets. Given such a broad target by definition, most businesses can point to some kind of big data they’d like to exploit.

Big data is growing bigger every day and storage vendors with their relatively expensive SAN and network-attached storage (NAS) systems are starting to work themselves into the big data party. They can’t simply leave all that data to server vendors filling boxes with commodity disk drives. Even if Hadoop adoption is just in its early stages, the competition and confusing marketing noise is ratcheting up.

In a Hadoop scale-out design, each physical node in the cluster hosts both local compute and a share of data; it’s intended to support applications, such as search, that often need to crawl through massively large data sets. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks across a scale-out cluster.

Hadoop is made up of a compute engine based on MapReduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop takes advantage of high data “locality” by spreading big data sets over many nodes using HDFS, farming out parallelized compute tasks to each data node (the “map” part of MapReduce), followed by various shuffling and sorting consolidation steps to produce a result (the “reduce” part).

Commonly, each HDFS data node will be assigned DAS disks to work with. HDFS will then replicate data across all the data nodes, usually making two or three copies on different data nodes. Replicas are placed on different server nodes, with the second replica placed on a different “rack” of nodes to help avoid rack-level loss. Obviously, replication takes up more raw capacity than RAID, but it also has some advantages like avoiding rebuild windows.

So if HDFS readily handles the biggest of data sets in a way native to the MapReduce style of processing, uses relatively cheap local disks and provides built-in “architecture-aware” replication, why consider enterprise-class storage? …

…(read the complete as-published article there)