5 trends driving the big data evolution

An IT industry analyst article published by SearchDataManagement.


article_5-trends-driving-the-big-data-evolution
The speedy evolution of big data technologies is connected to five trends, including practical applications of machine learning and cheap, abundantly available compute resources.

Mike Matchett
Small World Big Data

I’ve long said that all data will eventually become big data, and big data platforms will evolve into our next-generation data processing platform. We have reached a point in big data evolution where it is now mainstream, and if your organization is not neck-deep in figuring out how to implement big data technologies, you might be running out of time.

Indeed, the big data world continues to change rapidly, as I observed recently at the Strata Data Conference in New York. While there, I met with over a dozen key vendors in sessions and on the show floor.

Overall, the folks attending conferences like this one are less and less those slightly goofy and idealistic, open source research-focused geeks, and are more real-world big data and machine learning practitioners looking to solve real business problems in enterprise production environments. Given that basic vibe, here are my top five takeaways from Strata on the big data trends that are driving the big data evolution.

1. Structured data

Big data isn’t just about unstructured or semi-structured data anymore. Many of the prominent vendors, led by the key platform providers like Hortonworks, MapR and Cloudera, are now talking about big data implementations as full enterprise data warehouses (EDWs). The passive, often swampy data lake idea seems a bit passé, while there is a lot of energy aimed at providing practical, real-time business intelligence to a wider corporate swath of BI consumers.

I noted a large number of the big data-based acceleration competitors are applying on-demand analytics against tremendous volumes — both historical and streaming IoT style — of structured data.

Clearly, there is a war going on for the corporate BI and EDW investment. Given what I’ve seen, my bet is on big data platforms to inevitably outpace and outperform monolithic and proprietary legacy EDW.

2. Converged system of action

This leads into the observation that big data evolution includes implementations that host more and more of a company’s entire data footprint — structured and unstructured data together.

We’ve previously noted that many advanced analytical approaches can add tremendous value when they combine many formerly disparate corporate data sets of all different types…(read the complete as-published article there)

Big data processing could be the way all data is processed

An IT industry analyst article published by SearchITOperations.


article_Big-data-processing-could-be-the-way-all-data-is-processed
Some organizations take their time with new technologies to let first adopters suffer the growing pains. But there’s no treading water in the big data stream; the current won’t wait.

Mike Matchett
Small World Big Data

Have you noticed yet? Those geeky big data platforms based on clusters of commodity nodes running open source parallel processing algorithms are evolving into some seriously advanced IT functionality.

The popular branded distributions of the Apache projects, including Hortonworks, Cloudera and MapR, are no longer simply made up of relatively basic big data batch query tools, such as Hadoop MapReduce, the way they were 10 years ago. We’ve seen advances in machine learning, SQL-based transaction support, in-memory acceleration, interactive query performance, streaming data handling, enterprise IT data governance, protection and security. And even container services, scheduling and management are on a new level. Big data platforms now present a compelling vision for the future of perhaps all IT data processing.

Wait — do I really mean all IT data center processing will be big data processing? Most of us are just getting used to the idea of investing in and building out functional data lakes to capture and collect tons of unstructured data for business intelligence tasks, offline machine learning, active archive and other secondary data applications. And many are having a hard time making those data lake initiatives successful. It’s a challenge to develop staff expertise, assure data provenance, manage metadata and master implied schemas, i.e., creating a single version of truth.

…big data isn’t just for backroom data science geeks. The technologies involved are going to define the next-generation IT data center platform…

Many organizations may be waiting for things in the big data market to settle out. Unfortunately, especially for those more comfortable being late adopters, big data processing technology development is accelerating. We see use cases rapidly proliferate, and general IT manageability of big data streams (easing adoption and integration) greatly increase.

The universal big data onslaught is not going to slow down, nor will it wait for slackers to catch up. And those able to harness their big data streams today aren’t just using them to look up old baseball stats. They are able to use data to improve and accelerate operations, gain greater competitiveness and achieve actual ROI. I’m not even going to point out the possibility that savvy big data processing will uncover new revenue opportunities and business models. Oops, just did!

If you think you are falling behind today on big data initiatives, I’d recommend you consider doubling down now. This area is moving way too fast to jump on board later and still expect to catch competitors. Big data is proving to be a huge game changer. There simply won’t be a later with big data.

I’ve written before that all data is eventually going to be big data. I’ll now add that all processing is eventually going to be big data processing. In my view, the focus of big data technology has moved from building out systems of insight over trailing big data sets to now offering ways to build convergent systems of action over all data.

In other words, big data isn’t just for backroom data science geeks. The technologies involved are going to define the next-generation IT data center platform…(read the complete as-published article there)

Learn storage techniques for managing unstructured data use

An IT industry analyst article published by SearchStorage.


article_Learn-storage-techniques-for-managing-unstructured-data-use
Rearchitect storage to maximize unstructured data use at the global scale for larger data sets coming from big data analytics and other applications.

Mike Matchett
Small World Big Data

Back in the good old days, we mostly dealt with two storage tiers. We had online, high-performance primary storage directly used by applications and colder secondary storage used to tier less-valuable data out of primary storage. It wasn’t that most data lost value on a hard expiration date, but primary storage was pricey enough to constrain capacity, and we needed to make room for newer, more immediately valuable data.

We spent a lot of time trying to intelligently summarize and aggregate aging data to keep some kind of historical information trail online. Still, masses of detailed data were sent off to bed, out of sight and relatively offline. That’s all changing as managing unstructured data becomes a bigger concern. New services provide storage for big data analysis of detailed unstructured and machine data, as well as to support web-speed DevOps agility, deliver storage self-service and control IT costs. Fundamentally, these services help storage pros provide and maintain more valuable online access to ever-larger data sets.

Products for managing unstructured data may include copy data management (CDM), global file systems, hybrid cloud architectures, global data protection and big data analytics. These features help keep much, if not all, data available and productive.

Handling the data explosion

The underlying theme of many new storage offerings is to extend enterprise-quality IT management and governance across multiple tiers of global storage.

We’re seeing a lot of high-variety, high-volume and unstructured data. That’s pretty much everything other than highly structured database records. The new data explosion includes growing files and file systems, machine-generated data streams, web-scale application exhaust, endless file versioning, finer-grained backups and rollback snapshots to meet lower tolerances for data integrity and business continuity, and vast image and media repositories.

The public cloud is one way to deal with this data explosion, but it’s not always the best answer by itself. Elastic cloud storage services are easy to use to deploy large amounts of storage capacity. However, unless you want to create a growing and increasingly expensive cloud data dump, advanced storage management is required for managing unstructured data as well. The underlying theme of many new storage offerings is to extend enterprise-quality IT management and governance across multiple tiers of global storage, including hybrid and public cloud configurations.

If you’re architecting a new approach to storage, especially unstructured data storage at a global enterprise scale, here are seven advanced storage capabilities to consider:

Automated storage tiering. Storage tiering isn’t a new concept, but today it works across disparate storage arrays and vendors, often virtualizing in-place storage first. Advanced storage tiering products subsume yesterday’s simpler cloud gateways. They learn workload-specific performance needs and implement key quality of service, security and business cost control policies.

Much of what used to make up individual products, such as storage virtualizers, global distributed file systems, bulk data replicators, and migrators and cloud gateways, are converging into single-console unifying storage services. Enmotus and Veritas offer these simple-to-use services. This type of storage tiering enables unified storage infrastructure and provides a core service for many different types of storage management products.

Metadata at scale. There’s a growing focus on collecting and using storage metadata — data about stored data — when managing unstructured data. By properly aggregating and exploiting metadata at scale, storage vendors can better virtualize storage, optimize services, enforce governance policies and augment end-user analytical efforts.

Metadata concepts are most familiar in an object or file storage context. However, advanced block and virtual machine-level storage services are increasingly using metadata detail to help with tiering for performance. We also see metadata in data protection features. Reduxio’s infinite snapshots and immediate recovery based on timestamping changed blocks take advantage of metadata, as do change data capture techniques and N-way replication. When looking at heavily metadata-driven storage, it’s important to examine metadata protection schemes and potential bottlenecks. Interestingly, metadata-heavy approaches can improve storage performance because they usually allow for high metadata performance and scalability out of band from data delivery.

Storage analytics. You can use metadata and other introspective analytics about storage use gathered across enterprise storage, both offline and increasingly in dynamic optimizations. Call-home management is one example of how these analytics are used to better manage storage…(read the complete as-published article there)

Is demand for data storage or supply driving increased storage?

An IT industry analyst article published by SearchStorage.


article_Is-demand-for-data-storage-or-supply-driving-increased-storage
Figuring out whether we’re storing more data than ever because we’re producing more data or because constantly evolving storage technology lets us store more of it isn’t easy.

Mike Matchett
Small World Big Data

Whether you’re growing on-premises storage or your cloud storage footprint this year, it’s likely you’re increasing total storage faster than ever. Where we used to see capacity upgrade requests for proposals in terms of tens of terabytes growth, we now regularly see RFPs for half a petabyte or more. When it comes to storage size, huge is in.

Do we really need that much more data to stay competitive? Yes, probably. Can we afford extremely deep storage repositories? It seems that we can. However, these questions raise a more basic chicken-and-egg question: Are we storing more data because we’re making more data or because constantly evolving storage technology lets us?

Data storage economics
Looked at from a pricing perspective, the question becomes what’s driving price — more demand for data storage or more storage supply? I’ve heard economics professors say they can tell who really understands basic supply and demand price curve lessons when students ask this kind of question and consider a supply-side answer first. People tend to focus on demand-side explanations as the most straightforward way of explaining why prices fluctuate. I guess it’s easier to assume supply is a remote constant while envisioning all the possible changes in demand for data storage.

As we learn to wring more value out of our data, we want to both make and store more data.

But if storage supply is constant, given our massive data growth, then it should be really expensive. The massive squirreling away of data would instead be constrained by that high storage price (low availability). This was how it was years ago. Remember when traditional IT application environments struggled to fit into limited storage infrastructure that was already stretched thin to meet ever-growing demand?

Today, data capacities are growing large fast, and yet the price of storage keeps dropping (per unit of storage capacity). There’s no doubt supply is rising faster than demand for data storage. Technologies that bring tremendous supply-side benefits, such as the inherent efficiencies in shared cloud storage — and Moore’s law and clustered open source file systems like Hadoop Distributed File System and other technologies — have made bulk storage capacity so affordable that despite massive growth in demand for data storage, the price of storage continues to drop.

Endless data storage
When we think of hot new storage technologies, we tend to focus on primary storage advances such as flash and nonvolatile memory express. All so-called secondary storage comes, well, second. It’s true the relative value of a gigabyte of primary storage has greatly increased. Just compare the ROI of buying a whole bunch of dedicated, short-stroked HDDs as we did in the past to investing in a modicum of today’s fully deduped, automatically tiered and workload-shared flash.

It’s also worth thinking about flash storage in terms of impact on capacity, not just performance. If flash storage can serve a workload in one-tenth the time, it can also serve 10 similar workloads in the same time, providing an effective 10-times capacity boost.

But don’t discount the major changes that have happened in secondary storage…(read the complete as-published article there)

Scalable Persistent Storage for Kubernetes Using MapR

Lots of storage solutions can claim to provide adequate container storage when there are 10’s or 100’s of containers, but what are you going to do when you really need to push the “go” button on your next gen apps, and spin up 100k’s+ containers across a hybrid cloud architecture?

MapR just introduced a very compelling container solution, of course leveraging the highly scalable and production-proven MapR platform. The big data storage layer in Mapr is already able to handle trillions of objects/files/tables/streams (hey it’s big data AND POSIX compliant AND…) in a highly scalable (and enteprise-y) manner.

In this short video bit just released on Truth In IT (with transcript), I interview Jack Norris from MapR about the new MapR for Kubernetes solution, announced yesterday.