Accelerate Apache Spark to boost big data platforms

An IT industry analyst article published by SearchITOperations.


article_Accelerate-Apache-Spark-to-boost-big-data-platforms
Big data platforms like Apache Spark process massive volumes of data faster than other options. As data volumes grow, enterprises seek ways to speed up Spark.

Mike Matchett

So, we have data — lots and lots of data. We have blocks, files and objects in storage. We have tables, key values and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.

It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence — without missing or dropping a single byte.

Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven’t yet seen Spark show up in the production side of your data center, you will soon. Organizations that don’t, or can’t, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.

Spark, with its distributed in-memory processing architecture — and native libraries providing both expert machine learning and SQL-like data structures — was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.

If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we’ve noted:

  1. In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works — such as sharding and caching — how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management — i.e., compact data encoding — and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we’ll continue to see future releases aggressively pursue greater Spark acceleration.
  2. Native streaming data. The hottest topic in big data is how to deal with streaming data.

…(read the complete as-published article there)