Learn storage techniques for managing unstructured data use

An IT industry analyst article published by SearchStorage.

Rearchitect storage to maximize unstructured data use at the global scale for larger data sets coming from big data analytics and other applications.

Mike Matchett
Small World Big Data

Back in the good old days, we mostly dealt with two storage tiers. We had online, high-performance primary storage directly used by applications and colder secondary storage used to tier less-valuable data out of primary storage. It wasn’t that most data lost value on a hard expiration date, but primary storage was pricey enough to constrain capacity, and we needed to make room for newer, more immediately valuable data.

We spent a lot of time trying to intelligently summarize and aggregate aging data to keep some kind of historical information trail online. Still, masses of detailed data were sent off to bed, out of sight and relatively offline. That’s all changing as managing unstructured data becomes a bigger concern. New services provide storage for big data analysis of detailed unstructured and machine data, as well as to support web-speed DevOps agility, deliver storage self-service and control IT costs. Fundamentally, these services help storage pros provide and maintain more valuable online access to ever-larger data sets.

Products for managing unstructured data may include copy data management (CDM), global file systems, hybrid cloud architectures, global data protection and big data analytics. These features help keep much, if not all, data available and productive.

Handling the data explosion

The underlying theme of many new storage offerings is to extend enterprise-quality IT management and governance across multiple tiers of global storage.

We’re seeing a lot of high-variety, high-volume and unstructured data. That’s pretty much everything other than highly structured database records. The new data explosion includes growing files and file systems, machine-generated data streams, web-scale application exhaust, endless file versioning, finer-grained backups and rollback snapshots to meet lower tolerances for data integrity and business continuity, and vast image and media repositories.

The public cloud is one way to deal with this data explosion, but it’s not always the best answer by itself. Elastic cloud storage services are easy to use to deploy large amounts of storage capacity. However, unless you want to create a growing and increasingly expensive cloud data dump, advanced storage management is required for managing unstructured data as well. The underlying theme of many new storage offerings is to extend enterprise-quality IT management and governance across multiple tiers of global storage, including hybrid and public cloud configurations.

If you’re architecting a new approach to storage, especially unstructured data storage at a global enterprise scale, here are seven advanced storage capabilities to consider:

Automated storage tiering. Storage tiering isn’t a new concept, but today it works across disparate storage arrays and vendors, often virtualizing in-place storage first. Advanced storage tiering products subsume yesterday’s simpler cloud gateways. They learn workload-specific performance needs and implement key quality of service, security and business cost control policies.

Much of what used to make up individual products, such as storage virtualizers, global distributed file systems, bulk data replicators, and migrators and cloud gateways, are converging into single-console unifying storage services. Enmotus and Veritas offer these simple-to-use services. This type of storage tiering enables unified storage infrastructure and provides a core service for many different types of storage management products.

Metadata at scale. There’s a growing focus on collecting and using storage metadata — data about stored data — when managing unstructured data. By properly aggregating and exploiting metadata at scale, storage vendors can better virtualize storage, optimize services, enforce governance policies and augment end-user analytical efforts.

Metadata concepts are most familiar in an object or file storage context. However, advanced block and virtual machine-level storage services are increasingly using metadata detail to help with tiering for performance. We also see metadata in data protection features. Reduxio’s infinite snapshots and immediate recovery based on timestamping changed blocks take advantage of metadata, as do change data capture techniques and N-way replication. When looking at heavily metadata-driven storage, it’s important to examine metadata protection schemes and potential bottlenecks. Interestingly, metadata-heavy approaches can improve storage performance because they usually allow for high metadata performance and scalability out of band from data delivery.

Storage analytics. You can use metadata and other introspective analytics about storage use gathered across enterprise storage, both offline and increasingly in dynamic optimizations. Call-home management is one example of how these analytics are used to better manage storage…(read the complete as-published article there)

Documents Are Data

A graphical despiction of a very simple xhtml ...

Image via Wikipedia

If you think of an HTML page as a structured “marked up” document, it’s basically a form of data. The structure, in this case represented by HTML tags like <a> and <div>, identifies various document elements which can be interpreted as data fields. In fact, strict HTML is good XML (referred to as XHTML), a well-understood data format.

Think of “documents” as structured data where the structure is included in the document itself. The structure is free-form in the sense that the document author decides what data fields are included, how they are organized, ordered, nested, related, and so on. If you are object-oriented, you can also view each document as an object (technically for web pages this is referred to as the DOM – “Document Object Model“).

Big Data in a Small World

Google perhaps lead the way into this brave new world with their proprietary “Big Table” database architecture backing many of their services.  Apache’s Hadoop project (includes HDFS and HBase) is fundamentally based on Google’s open papers.

Most big data sets in our small world are going to be produced by numerous (countless?) authors and applications. Most big data is going to be in the form of documents rather than standardized (i.e. described by external schema) transactional data. Since most current data handling, storage, and analysis technology is aimed at transactional schema controlled data, I’m thrilled to explore today’s emerging market of new “big data” solutions.

Enhanced by Zemanta