The New Big Thing in Big Data: Results From Our Apache Spark Survey

(Excerpt from original post on the Taneja Group News Blog)

In the last few months I’ve been really bullish on Apache Spark as an big enabler of wider big data solution adoption. Recently we got the great opportunity to conduct some deep Spark market research (with Cloudera’s sponsorship) and were able to survey nearly seven thousand (6900+) highly qualified technical and managerial people working with big data from around the world.
   
Some highlights — First, across the broad range of industries, company sizes, and big data maturities, over one-half (54%) of respondents are already actively using Spark to solve a primary organizational use case. That’s an incredible adoption rate, and no doubt due to the many ways Spark makes big data analysis accessible to a much wider audience – not just Phd’s but anyone with a modicum of SQL and scripting skills.
   
When it comes to use cases, in addition to the expected Data Processing/Engineering/ETL use case (55%), we found high rates of forward-looking and analytically sophisticated use cases like Real-time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). And support for the more traditional customer intelligence (31%) and BI/DW (29%) use cases weren’t far behind. By adding those numbers up you can see that many organizations indicated that Spark was already being applied to more than one important type of use case at the same time – a good sign that Spark supports nuanced applications and offers some great efficiencies (sharing big data, converging analytical approaches).
 
Is Spark going to replace Hadoop and the Hadoop ecosystem of projects?  A lot of folks run Spark on its own cluster, but we assess mostly only for performance and availability isolation. And that is likely just a matter of platform maturity – its likely future schedulers (and/or something like Pepperdata) will solve the multi-tenancy QoS issues with running Spark alongside and converged with any and all other kinds of data processing solutions (e.g. NoSQL, Flink, search…).
 
In practice already, converged analytics are the big trend with near half of current users (48%) said they used Spark with HBase and 41% again also with Kafka. Production big data solutions are actually pipelines of activities that span from data acquisition and ingest through full data processing and disposition. We believe that as Spark grows its organizational footprint out from initial data processing and ad-hoc data science into advanced operational (i.e. data center) production applications, that it truly blossoms when fully enabled by supporting other big data ecosystem technologies.

…(read the full post)

Big Data Enterprise Maturity

(Excerpt from original post on the Taneja Group News Blog)

It’s time to look at big data again. Last week I was at Cloudera’s growing and vibrant annual analyst event to hear the latest from the folks who know what’s what. Then this week Strata (conference for data scientists) brings lots of public big data vendor announcements. A noticeable shift this year is less focus on how to apply big data and more about maturing enterprise features intended to ease wider data center level adoption. A good example is the “mixed big data workload QoS” cluster optimizating solution from Pepperdata.

…(read the full post)

Kudu Might Be Invasive: Cloudera Breaks Out Of HDFS

(Excerpt from original post on the Taneja Group News Blog)

For the IT crowd just now getting to used to the idea of big data’s HDFS (Hadoop’s Distributed File System) and it’s peculiarities, there is another alternative open source big data file system coming from Cloudera called Kudu. Like HDFS, Kudu is designed to be hosted across a scale-out cluster of commodity systems, but specifically intended to support more low-latency analytics.

…(read the full post)

A Billion Here, A Billion There – Big Data Is Big Money

(Excerpt from original post on the Taneja Group News Blog)

When we talk about big data today we aren’t talking just about the data and its three V’s (or up to 15 depending on who you consult), but more and more about the promise of big transformation to the data center. In other words, it’s about big money.

First, consider recent news about some key Hadoop distro vendors. Many of them are now billion dollar players, much of that on speculation and expectation of future data center occupation. When Pivotal spun off from EMC it got to start with a gaggle of successful commercially deployed products giving it a tremendous day one revenue stream. With GE’s 10% outside stake at $105M that made them a billion dollar startup. Coming back from the Cloudera Analyst Event last month we found that Cloudera was doing really well with $160M in new funding, but soon thereafter Intel weighed in to top them up over a billion in funding (valuation at 4.1B). Not to be left out in the cold, Hortonworks announced a $100M round that valued them at $1B (ostensibly they claim they could take in 20x more, but are raising funds as they need them).

Second, consider the infrastructure that not just billions but trillions and more (gadzillions?) pieces of data have to still land on, even if made up of commodity disks/server clusters. Of course most companies are going to want to build out big data solutions, or they risk getting left behind competitively. But many of these are going to eventually turn into massive investments that only grow as the data grows (i.e. predicted to be exponential!) and occupy more and more of the data center, not stay constrained as little R+D projects or simple data warehouse offloading platforms.

Cleary big data is now a playing field for competition amongst billionaires.  I’m sure the lot of startups in that space are only encouraged by the ecosystem wealth and opportunity for acquisition, but as the big money grows, keep an eye on how standards and open source foundations increasingly become political battlefields, with results not always in the best interest of the end user. 

While there is an open source model underpinning the whole ecosystem, with this much money on the table it will be interesting to see how fair competition plays out. From my perspective it looks like big data isn’t going to be very free, or there wouldn’t be billions of dollars in bets being made. Up till now most of the ecosystem vendors have been making arguments about providing better support than the other guy.  In that academic view, there is not much call for outside analysis or third party validation.

But every big, big data vendor we talk now to has some proprietary angle on how they do things better than the next guy – with lurking implied vendor lock-in –  based on how enterprises can effectively manage big data or extract maximal value from it. Which sounds like the current IT vendor ecosystem we know and love.  And which requires some analysis and validation to separate the wheat from the chaff, the real from the hype.

As an IT organization faced with big data challenges, how do you feel about suddenly dealing with billion dollar behemoths in a space founded on open source principles? In the end, it doesn’t really impact our recommended approach – you need to have enterprise capabilities for big data, and you always were likely to get the best of those from vendors with highly competitive proprietary technology. We’ve started working with big data vendors now as real IT solution vendors. In our book, Pivotal, Cloudera, Hortonworks and the like have simply graduated into the full-fledged IT vendor category, which can only help the IT organization faced with enterprise-level big data challenges.

…(read the full post)

Enterprise IT Will Dive Into Big Data Solutions in 2013

(Excerpt from original post on the Taneja Group News Blog)

If you are in IT, 2013 is going to be the year that you will want to dive into the “big data” pool if you haven’t been pushed in already. But don’t worry – it’s no longer sink or swim. For one, we’ll be here to help coach you through it all. And while the concepts, terminology and hype have been all over the place, once you start floating around you’ll find that under the surface much of what fills the big data pool is familiar IT infrastructure, data management, and services re-cast around a few easy-to-grasp innovations.

For example, if you are in IT and asked to pick a Hadoop distro to stand up, you’d probably start with evaluating the three main vendor distributions of Hadoop (rather than getting it straight off Apache) followed by other downstream OEM’d and pre-integrated versions.  The main supported distros are from Cloudera, Hortonworks, and MapR.  I didn’t really appreciate the differences until talking with all three individually (at 2012 NY Strata, see below).  

What really struck me was that Hadoop, from an IT perspective, is really a set of distributed data storage services at heart. Sure, you can talk about Hadoop as an architecture based on parallel functional programming that uses map/reduce tasks to move compute process to distributed data sets, or as an analytics platform for huge volumes of unstructured data, or as a commoditized/democratized scale-out data mining solution, but from an IT viewpoint, it looks and smells a lot like a way to cost-effectively store and access large volumes of data.  It’s an innovative twist on the age-old IT data storage problem when faced with ever growing amounts of data – how to support your business cost-effectively deriving value out of growing data stores.

In an nutshell, Hortonworks takes the high-road regarding the open source Hadoop and related Apache projects, and focuses on 100% open source support. There is some good argument that by sticking with open source, you are assured future open source benefits.  If you start paying for bastardized distros and versions, you will, well, have to continue paying for it.

Cloudera is more commercially focused to help business deploy productive big data solutions. While offering a free Cloudera packaged distro version (CDH), they have a premium enterprise management offering when you get over 50 nodes. CDH is the most widely used Hadoop distro today. For example, you can buy a Dell Cloudera pre-integrated cluster, ready-to-go.

MapR thought as along as they were going to focus on helping businesses compete, they shouldn’t be constrained by open source and instead aggressively optimize the whole thing including the core parts for high performance and scalability. MapR also has a free community edition (M3), but the real enterprise solutions are M5/M7 which add sophisticated enterprise storage features directly to Hadoop including mirroring, snapshots, HA, and data placement controls (MapR optimizes other parts too, including now a highly customized Hbase). In other words, at the fundamental level MapR enterprise versions look like a hard-core “big” data storage services (EMC chose MapR for its Greenplum distro).

We’ll dive into more big data topics soon, but wanted to say I thoroughly enjoyed Strata (and the combined Hadoop World) in NY a few months ago. It was sold out to the “walls”. However, folks walking in just to see even the vendor exhibition floor were turned away due to already packing the place to the maximum allowed by fire codes. The Strata/Hadoop co-sponsors O’Reilly and Cloudera should think about stepping up to a much larger venue like the Jarvitz next time. Let vendors have bigger booths and welcome all who are interested drop in. Maybe even make the vendor show floor “free” to IT folks? I’m sure the vendors would love the traffic and attention.

I am looking forward to the future Strata events – one is coming up on the West coast next month (Feb 26-28) and I’d encourage IT folks in the area to check it out. The program web page indicates they recognize the IT impact big data is having with “Big Data for Enterprise IT” themed presentations. Hope to see you there!

…(read the full post)