Kudu Might Be Invasive: Cloudera Breaks Out Of HDFS

(Excerpt from original post on the Taneja Group News Blog)

For the IT crowd just now getting to used to the idea of big data’s HDFS (Hadoop’s Distributed File System) and it’s peculiarities, there is another alternative open source big data file system coming from Cloudera called Kudu. Like HDFS, Kudu is designed to be hosted across a scale-out cluster of commodity systems, but specifically intended to support more low-latency analytics.

…(read the full post)

Project Myriad Will Become Your Next Data Center Platform

(Excerpt from original post on the Taneja Group News Blog)

One of the big things bubbling around at Strata this week is talk about YARN, Mesos, and Project Myriad (initiated/sponsored by MapR).  One on hand it seems that this is just about some evolution of the Hadoop scheduling layer, but looking at with a critical eye, I see the impending culmination of what I predicted years ago – that the Hadoop ecosystem will quickly evolve to bring high-powered HPC technologies right into the heart of the next gen enterprise data center.

…(read the full post)

HP Vertica Goes Hadoop on MapR’s Read Write Infrastructure

(Excerpt from original post on the Taneja Group News Blog)

Today MapR and HP Vertica are rolling out an exciting joint integration, nicely addressing full SQL-on-Hadoop use cases. Vertica is now runnable, actually “pluggable”, on and into MapR’s enterprise quality Hadoop distribution. This is an interesting feat that depends highly on MapR’s unique implementation of enterprise grade storage in place of the open source HDFS guts.

…(read the full post)

Enterprise IT Will Dive Into Big Data Solutions in 2013

(Excerpt from original post on the Taneja Group News Blog)

If you are in IT, 2013 is going to be the year that you will want to dive into the “big data” pool if you haven’t been pushed in already. But don’t worry – it’s no longer sink or swim. For one, we’ll be here to help coach you through it all. And while the concepts, terminology and hype have been all over the place, once you start floating around you’ll find that under the surface much of what fills the big data pool is familiar IT infrastructure, data management, and services re-cast around a few easy-to-grasp innovations.

For example, if you are in IT and asked to pick a Hadoop distro to stand up, you’d probably start with evaluating the three main vendor distributions of Hadoop (rather than getting it straight off Apache) followed by other downstream OEM’d and pre-integrated versions.  The main supported distros are from Cloudera, Hortonworks, and MapR.  I didn’t really appreciate the differences until talking with all three individually (at 2012 NY Strata, see below).  

What really struck me was that Hadoop, from an IT perspective, is really a set of distributed data storage services at heart. Sure, you can talk about Hadoop as an architecture based on parallel functional programming that uses map/reduce tasks to move compute process to distributed data sets, or as an analytics platform for huge volumes of unstructured data, or as a commoditized/democratized scale-out data mining solution, but from an IT viewpoint, it looks and smells a lot like a way to cost-effectively store and access large volumes of data.  It’s an innovative twist on the age-old IT data storage problem when faced with ever growing amounts of data – how to support your business cost-effectively deriving value out of growing data stores.

In an nutshell, Hortonworks takes the high-road regarding the open source Hadoop and related Apache projects, and focuses on 100% open source support. There is some good argument that by sticking with open source, you are assured future open source benefits.  If you start paying for bastardized distros and versions, you will, well, have to continue paying for it.

Cloudera is more commercially focused to help business deploy productive big data solutions. While offering a free Cloudera packaged distro version (CDH), they have a premium enterprise management offering when you get over 50 nodes. CDH is the most widely used Hadoop distro today. For example, you can buy a Dell Cloudera pre-integrated cluster, ready-to-go.

MapR thought as along as they were going to focus on helping businesses compete, they shouldn’t be constrained by open source and instead aggressively optimize the whole thing including the core parts for high performance and scalability. MapR also has a free community edition (M3), but the real enterprise solutions are M5/M7 which add sophisticated enterprise storage features directly to Hadoop including mirroring, snapshots, HA, and data placement controls (MapR optimizes other parts too, including now a highly customized Hbase). In other words, at the fundamental level MapR enterprise versions look like a hard-core “big” data storage services (EMC chose MapR for its Greenplum distro).

We’ll dive into more big data topics soon, but wanted to say I thoroughly enjoyed Strata (and the combined Hadoop World) in NY a few months ago. It was sold out to the “walls”. However, folks walking in just to see even the vendor exhibition floor were turned away due to already packing the place to the maximum allowed by fire codes. The Strata/Hadoop co-sponsors O’Reilly and Cloudera should think about stepping up to a much larger venue like the Jarvitz next time. Let vendors have bigger booths and welcome all who are interested drop in. Maybe even make the vendor show floor “free” to IT folks? I’m sure the vendors would love the traffic and attention.

I am looking forward to the future Strata events – one is coming up on the West coast next month (Feb 26-28) and I’d encourage IT folks in the area to check it out. The program web page indicates they recognize the IT impact big data is having with “Big Data for Enterprise IT” themed presentations. Hope to see you there!

…(read the full post)