Accelerate Apache Spark to boost big data platforms

An IT industry analyst article published by SearchITOperations.

Big data platforms like Apache Spark process massive volumes of data faster than other options. As data volumes grow, enterprises seek ways to speed up Spark.

Mike Matchett

So, we have data — lots and lots of data. We have blocks, files and objects in storage. We have tables, key values and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.

It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence — without missing or dropping a single byte.

Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven’t yet seen Spark show up in the production side of your data center, you will soon. Organizations that don’t, or can’t, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.

Spark, with its distributed in-memory processing architecture — and native libraries providing both expert machine learning and SQL-like data structures — was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.

If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we’ve noted:

  1. In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works — such as sharding and caching — how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management — i.e., compact data encoding — and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we’ll continue to see future releases aggressively pursue greater Spark acceleration.
  2. Native streaming data. The hottest topic in big data is how to deal with streaming data.

…(read the complete as-published article there)

In Memory Big Data Heats Up With Apache Ignite

(Excerpt from original post on the Taneja Group News Blog)

Recently we posted about GridGain contributing their core in-memory solution to the Apache Ignite project. While this is still incubating, it’s clear that this was a good move for GridGrain, and a win for the big data/BI community in general. Today Apache Ignite drops its v1.0 release candidate with some new features added in like built-in support for jCache and an autoloader to help migrate data and schema in from existing SQL databases (e.g. Oracle, MySQL, Postgres, DB2, Microsoft SQL, etc.).

…(read the full post)

IoT Goes Real-Time, Gets Predictive – Glassbeam Launches Spark-based Machine Learning

(Excerpt from original post on the Taneja Group News Blog)

In-Memory processing was all the rage at Strata 2014 NY last month, and the hottest word was Spark! Spark is big data scale-out cluster solution that provides a way to speedily analyze large data sets in-memory using a “resilient distributed data” design for fault-tolerance.  It can deploy into its own optimized cluster, or ride on top of Hadoop 2.0 using YARN, (although it is a different processing platform/paradigm from MapReduce – see this post on GridGain for a Hadoop MR In-memory solution).

…(read the full post)

Turn to in-memory processing when performance matters

An IT industry analyst article published by SearchDataCenter.

In-memory processing is faster, and vendors are innovating to make in-memory database technology cheaper and better.

In-memory processing can improve data mining and analysis, and other dynamic data processing uses. When considering in-memory, however, look out for data protection, cost and bottlenecks.When you need top database speed, in-memory processing provides the ultimate in low latency. But can your organization really make cost-effective use of an in-memory database? It’s hard to know whether that investment will pay off in real business value.

And even if the performance boost is justified, is it possible to adequately protect important data kept live in-memory from corruption or loss? Can an in-memory system scale and keep pace with what’s likely to be exponential data growth?

There’s an ongoing vendor race to address these concerns. Vendors are trying to practically deliver the performance advantages of in-memory processing to a wider IT market as analytics, interactive decision-making and other (near-) real-time use cases become more mainstream.
Memory is the fastest medium

Using memory to accelerate performance of I/O-bound applications is not a new idea; it has always been true that processing data in memory is faster (10 to 1,000 times or more) than waiting on relatively long I/O times to read and write data from slower media — flash included.

Since the early days of computing, performance-intensive products have allocated memory as data cache. Most databases were designed to internally use as much memory as possible. Some might even remember setting up RAM disks for temporary data on their home PCs back in the MS-DOS days to squeeze more speed out of bottlenecked systems.

Today’s in-memory processing takes that concept to the extreme: using active memory (dynamic RAM) to hold current running database code and active data structures, and keep the persistent database in memory. These databases forget about making any slow trips off the motherboard to talk to external media and instead optimize their data structures for memory-resident processing.

Historically, both the available memory density per server and the relatively high cost of memory were limiting factors, but today there are technologies expanding the effective application of in-memory processing to larger data sets. These include higher per-server memory architectures, inline/online deduplication and compression that use extra (and relatively cheap) CPU capacity to squeeze more data into memory, and cluster and grid tools that can scale out the total effective in-memory footprint.

Memory continues to get cheaper and denser. Laptops now come standard with more addressable memory than entire mainframes once had. Today, anyone with a credit card can cheaply rent high-memory servers from cloud providers…

…(read the complete as-published article there)