The New Big Thing in Big Data: Results From Our Apache Spark Survey

(Excerpt from original post on the Taneja Group News Blog)

In the last few months I’ve been really bullish on Apache Spark as an big enabler of wider big data solution adoption. Recently we got the great opportunity to conduct some deep Spark market research (with Cloudera’s sponsorship) and were able to survey nearly seven thousand (6900+) highly qualified technical and managerial people working with big data from around the world.
   
Some highlights — First, across the broad range of industries, company sizes, and big data maturities, over one-half (54%) of respondents are already actively using Spark to solve a primary organizational use case. That’s an incredible adoption rate, and no doubt due to the many ways Spark makes big data analysis accessible to a much wider audience – not just Phd’s but anyone with a modicum of SQL and scripting skills.
   
When it comes to use cases, in addition to the expected Data Processing/Engineering/ETL use case (55%), we found high rates of forward-looking and analytically sophisticated use cases like Real-time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). And support for the more traditional customer intelligence (31%) and BI/DW (29%) use cases weren’t far behind. By adding those numbers up you can see that many organizations indicated that Spark was already being applied to more than one important type of use case at the same time – a good sign that Spark supports nuanced applications and offers some great efficiencies (sharing big data, converging analytical approaches).
 
Is Spark going to replace Hadoop and the Hadoop ecosystem of projects?  A lot of folks run Spark on its own cluster, but we assess mostly only for performance and availability isolation. And that is likely just a matter of platform maturity – its likely future schedulers (and/or something like Pepperdata) will solve the multi-tenancy QoS issues with running Spark alongside and converged with any and all other kinds of data processing solutions (e.g. NoSQL, Flink, search…).
 
In practice already, converged analytics are the big trend with near half of current users (48%) said they used Spark with HBase and 41% again also with Kafka. Production big data solutions are actually pipelines of activities that span from data acquisition and ingest through full data processing and disposition. We believe that as Spark grows its organizational footprint out from initial data processing and ad-hoc data science into advanced operational (i.e. data center) production applications, that it truly blossoms when fully enabled by supporting other big data ecosystem technologies.

…(read the full post)

Agile Big Data Clusters: DriveScale Enables Bare Metal Cloud

(Excerpt from original post on the Taneja Group News Blog)

We’ve been writing recently about the hot, potentially inevitable, trend, towards a dense IT infrastructure in which components like CPU cores and disks are not only commoditized, but deployed in massive stacks or pools (with fast matrixing switches between them). Then a layered provisioning solution can dynamically compose any desired “physical” server or cluster out of those components. Conceptually this becomes the foundation for a bare-metal cloud. DriveScale today announces their agile architecture with this approach, aimed first at solving big data multi-cluster operational challenges. 

…(read the full post)

Scaling All Flash to New Heights – DDN Flashscale All Flash Array Brings HPC to the Data Center

(Excerpt from original post on the Taneja Group News Blog)

It’s time to start thinking about massive amounts of flash in the enterprise data center. I mean PBs of flash for the biggest, baddest, fastest data-driven applications out there. This amount of flash requires an HPC-capable storage solution brought down and packaged for enterprise IT management. Which is where Data Domain Networks (aka DDN) is stepping up. Perhaps too quietly, they have been hard at work pivoting their high-end HPC portfolio into the enterprise space. Today they are rolling out a massively scalable new flash-centric Flashscale 14KXi storage array that will help them offer complete, comprehensive single-vendor big data workflow solutions – from the fastest scratch through the biggest throughput parallel file systems into the largest distributed object storage archives.

…(read the full post)

Hyperconverged Supercomputers For the Enterprise Data Center

(Excerpt from original post on the Taneja Group News Blog)

Last month NVIDIA, our favorite GPU vendor, dived into the converged appliance space. In fact we might call their new NVIDIA DGX-1 a hyperconverged supercomputer in a 4U box. Designed to support the application of GPU’s to Deep Learning (i.e. compute intensive deeply layered neural networks that need to train and run in operational timeframes over big data), this beast has 8 new Tesla P100 GPUs inside on an embedded NVLink mesh, pre-integrated with flash SSDs, decent memory, and an optimized container-hosting deep learning software stack. The best part? The price is surprisingly affordable, and can replace the 250+ server cluster you might otherwise need for effective Deep Learning.

…(read the full post)

Big Data Enterprise Maturity

(Excerpt from original post on the Taneja Group News Blog)

It’s time to look at big data again. Last week I was at Cloudera’s growing and vibrant annual analyst event to hear the latest from the folks who know what’s what. Then this week Strata (conference for data scientists) brings lots of public big data vendor announcements. A noticeable shift this year is less focus on how to apply big data and more about maturing enterprise features intended to ease wider data center level adoption. A good example is the “mixed big data workload QoS” cluster optimizating solution from Pepperdata.

…(read the full post)

Kudu Might Be Invasive: Cloudera Breaks Out Of HDFS

(Excerpt from original post on the Taneja Group News Blog)

For the IT crowd just now getting to used to the idea of big data’s HDFS (Hadoop’s Distributed File System) and it’s peculiarities, there is another alternative open source big data file system coming from Cloudera called Kudu. Like HDFS, Kudu is designed to be hosted across a scale-out cluster of commodity systems, but specifically intended to support more low-latency analytics.

…(read the full post)

Time To Use The Force, IT! – OpsDataStore Unifies Systems Management Data

(Excerpt from original post on the Taneja Group News Blog)

We are only a bit excited by the impending Star Wars release. How old were we when the first one came out? I’m not saying. We are all very excited here to see this new continuation – of the story, the characters, and the universe of the force. Especially compared to our day-to-day IT management reality which often seems stuck in the 70’s. Systems management has been around even longer than the Star Wars franchise, but it seems to have stagnated along the way. Where is the rebellion? The good Jedi warriors to save us all from the Dark side?

…(read the full post)