Data lakes swim with golden information for analytics

An IT industry analyst article published by SearchDataCenter.


One of the biggest themes in big data these days is data lakes.

Available data grows by the minute, and useful data comes in many different shapes and levels of structure. Big data (i.e., Hadoop) environments have proven good at batch processing of unstructured data at scale, and useful as an initial landing place to host all kinds of data in low-level or raw form in front of downstream data warehouse and business intelligence (BI) tools. On top of that, Hadoop environments are beginning to develop capabilities for analyzing structured data and for near real-time processing of streaming data.

The data lake concept captures all analytically useful data onto one single infrastructure. From there, we can apply a kind of “schema-on-read” approach using dynamic analytical applications, rather than pre-build static extract, transform and load (ETL) processes that feed only highly structured data warehouse views. With clever data lake strategies, we can combine SQL and NoSQL database approaches, and even meld online analytics processing (OLAP) and online transaction processing (OLTP) capabilities. Keeping data in a single, shared location means administrators can better provide and widely share not only the data, but an optimized infrastructure with (at least theoretically) simpler management overhead.

The smartest of new big data applications might combine different kinds of analysis over different kinds of data to produce new decision-making information based on operational intelligence. The Hadoop ecosystem isn’t content with just offering super-sized stores of unstructured data, but has evolved quickly to become an all-purpose data platform in the data center.
…(read the complete as-published article there)

Moving to all-flash? Think about your data storage infrastructure

An IT industry analyst article published by SearchStorage.


Everyone is now onboard with flash. All the key storage vendors have at least announced entry into the all-flash storage array market, with most having offered hybrids — solid-state drive-pumped traditional arrays — for years. As silicon storage gets cheaper and denser, it seems inevitable that data centers will migrate from spinning disks to “faster, better and cheaper” options, with non-volatile memory poised to be the long-term winner.

But the storage skirmish today seems to be heading toward the total cost of ownership end of things, where two key questions must be answered:

  • How much performance is needed, and how many workloads in the data center have data with varying quality of service (QoS) requirements or data that ages out?
  • Are hybrid arrays a better choice to handle mixed workloads through advanced QoS and auto-tiering features?

All-flash proponents argue that cost and capacity will continue to drop for flash compared to hard disk drives (HDDs), and that no workload is left wanting with the ability of all-flash to service all I/Os at top performance. Yet we see a new category of hybrids on the market that are designed for flash-level performance and then fold in multiple tiers of colder storage. The argument there is that data isn’t all the same and its value changes over its lifetime. Why store older, un-accessed data on a top tier when there are cheaper, capacity-oriented tiers available?

It’s misleading to lump together hybrids that are traditional arrays with solid-state drives (SSDs) added and the new hybrids that might be one step evolved past all-flash arrays. And it can get even more confusing when the old arrays get stuffed with nothing but flash and are positioned as all-flash products. To differentiate, some industry wags like to use the term “flash-first” to describe newer-generation products purpose-built for flash speeds. That still could cause some confusion when considering both hybrids and all-flash designs. It may be more accurate to call the flash-first hybrids “flash-converged.” By being flash-converged, you can expect to buy one of these new hybrids with nothing but flash inside and get all-flash performance.

We aren’t totally convinced that the future data center will have just a two-tier system with flash on top backed by tape (or a remote cold cloud), but a “hot-cold storage” future is entirely possible as intermediate tiers of storage get, well, dis-intermediated. We’ve all predicted the demise of 15K HDDs for a while; can all the other HDDs be far behind as QoS controls get more sophisticated in handling the automatic mixing of hot and cold to create any temperature storage you might need? …

…(read the complete as-published article there)