Spark speeds up adoption of big data clusters and clouds

An IT industry analyst article published by SearchITOperations.

Infrastructure that supports big data comes from both the cloud and clusters. Enterprises can mix and match these seven infrastructure choices to meet their needs.

Mike Matchett

If enterprise IT has been slow to support big data analytics in production for the decade-old Hadoop, there has been a much faster ramp-up now that Spark is part of the overall package. After all, doing the same old business intelligence approach with broader, bigger data (with MapReduce) isn’t exciting, but producing operational time predictive intelligence that guides and optimizes business with machine precision is a competitive must-have.

With traditional business intelligence (BI), an analyst studies a lot of data and makes some hypotheses and a conclusion to form a recommendation. Using the many big data machine learning techniques supported by Spark’s MLlib, a company’s big data can dynamically drive operational-speed optimizations. Massive in-memory machine learning algorithms enable businesses to immediately recognize and act on inherent patterns in even big streaming data.

But the commoditization of machine learning itself isn’t the only new driver here. A decade ago, IT needed to stand up either a “baby” high performance computing cluster for serious machine learning or learn to write low-level distributed parallel algorithms to run on the commodity-based Hadoop MapReduce platform. Either option required both data science and exceptionally talented IT admins that could stand up and support massive physical scale-out clusters in production. Today there are many infrastructure options for big data clusters that can help IT deploy and support big data-driven applications.

Here are seven types of big data infrastructures for IT to consider, each with core strengths and differences:…(read the complete as-published article there)

Accelerate Apache Spark to boost big data platforms

An IT industry analyst article published by SearchITOperations.

Big data platforms like Apache Spark process massive volumes of data faster than other options. As data volumes grow, enterprises seek ways to speed up Spark.

Mike Matchett

So, we have data — lots and lots of data. We have blocks, files and objects in storage. We have tables, key values and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.

It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence — without missing or dropping a single byte.

Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven’t yet seen Spark show up in the production side of your data center, you will soon. Organizations that don’t, or can’t, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.

Spark, with its distributed in-memory processing architecture — and native libraries providing both expert machine learning and SQL-like data structures — was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.

If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we’ve noted:

  1. In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works — such as sharding and caching — how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management — i.e., compact data encoding — and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we’ll continue to see future releases aggressively pursue greater Spark acceleration.
  2. Native streaming data. The hottest topic in big data is how to deal with streaming data.

…(read the complete as-published article there)

Four big data and AI trends to keep an eye on

An IT industry analyst article published by SearchITOperations.

AI is making a comeback — and it’s going to affect your data center soon.

Mike Matchett

Big data and artificial intelligence will affect the world — and already are — in mind-boggling ways. That includes, of course, our data centers.

The term artificial intelligence (AI) is making a comeback. I interpret AI as a larger, encompassing umbrella that includes machine learning — which in turn includes deep learning methods — but also implies thought. Meanwhile, machine learning is somehow safe to talk about. It’s just some applied math — e.g., built-over probabilities, linear algebra, differential equations — under the hood. But use the term AI and, suddenly, you get wildly different emotional reactions —for example, the Terminator is coming. However, today’s broader field of AI is working toward providing humanity with enhanced and automated vision, speech and reasoning.

If you’d like to stay on top of what’s happening practically in these areas, here are some emerging big data and AI trends to watch that might affect you and your data center sooner rather than later:

Where there is a Spark…
Apache Spark is replacing basic Hadoop MapReduce for latency-sensitive big data jobs with its in-memory, real-time queries and fast machine learning at scale. And with familiar, analyst-friendly data constructs and languages, Spark brings it all within reach of us middling hacker types.

As far as production bulletproofing, it’s not quite fully baked. But version two of Spark was just released in mid-2016, and it’s solidifying fast. Even so, this fast-moving ecosystem and potential “Next Big Things” such as Apache Flink are already turning heads.

Even I can do it. A few years ago, all this big data and AI stuff required doctorate-level data scientists. In response, a few creative startups attempted to short-circuit those rare and expensive math geeks out of the standard corporate analytics loop and provide the spreadsheet-oriented business intelligence analyst some direct big data access.

Today, as with Spark, I get a real sense that big data analytics is finally within reach of the average engineer or programming techie. The average IT geek may still need to apply him or herself to some serious study but can achieve great success creating massive organizational value. In other words, there is now a large and growing middle ground where smart non-data scientists can be very productive with applied machine learning even on big and real-time data streams…(read the complete as-published article there)

Visualizing (and Optimizing) Cluster Performance

(Excerpt from original post on the Taneja Group News Blog)

Clusters are the scale-out way to go in today’s data center. Why not try to architect an infrastructure that can grow linearly in capacity and/or performance? Well, one problem is that operations can get quite complex especially when you start mixing workloads and tenants on the same cluster. In vanilla big data solutions everyone can compete, and not always fairly, for the same resources. This is a growing problem in production environments where big data apps are starting to underpin key business-impacting processes.

…(read the full post)