Accelerate Apache Spark to boost big data platforms

An IT industry analyst article published by SearchITOperations.


article_Accelerate-Apache-Spark-to-boost-big-data-platforms
Big data platforms like Apache Spark process massive volumes of data faster than other options. As data volumes grow, enterprises seek ways to speed up Spark.

Mike Matchett

So, we have data — lots and lots of data. We have blocks, files and objects in storage. We have tables, key values and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.

It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence — without missing or dropping a single byte.

Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven’t yet seen Spark show up in the production side of your data center, you will soon. Organizations that don’t, or can’t, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.

Spark, with its distributed in-memory processing architecture — and native libraries providing both expert machine learning and SQL-like data structures — was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.

If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we’ve noted:

  1. In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works — such as sharding and caching — how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management — i.e., compact data encoding — and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we’ll continue to see future releases aggressively pursue greater Spark acceleration.
  2. Native streaming data. The hottest topic in big data is how to deal with streaming data.

…(read the complete as-published article there)

Excuse me, but I think your cache is showing…

(Excerpt from original post on the Taneja Group News Blog)

Everybody these days is adding flash-based SSD to their storage arrays.  Some are offering all flash storage for ultra-high performance.  And a few are popping flash storage right into the server as a very large, persistent cache.  But taking advantage of flash in these ways requires either hardware refresh or significant service disruption – or both.

GridIron offers a drop-in, non-disruptive way to immediately super-charge existing infrastructure. Their TurboCharger appliances logically plug into the middle of the SAN fabric where they can be installed (and removed) non-disruptively by taking advantage of I/O multi-pathing.  Once installed, they jump in to the data path as a virtual LUN fronting the real LUN on the back-end, providing a massive amount of SSD write-through cache that automatically adjusts to multiple workloads.  Because it’s in the SAN, TurboCharger can virtually “front” any underlying storage – even storage that is in turn further virtualized.

GridIron customers have generally faced serious data access challenges with large databases and in consolidated and virtualized environments that benefit from read-intensive IO acceleration. GridIron is now expanding its product line to help accelerate structured and unstructured “big data” access.   The OneAppliance all-Flash product line includes the FlashCube for offloading temp, log, and scratch space write-intensive workloads, and an iNode that combines massive flash and compute together for building high-performance compute clusters.

GridIron is clearly differentiating from other flash solutions in its direct and practical approach to bringing the power of flash to bear directly on the extreme data access and movement problems with big data. 

…(read the full post)