The New Challenges of Capacity Management In Virtualized Cloudy IT

The New Challenges of Capacity Management In Virtualized Cloudy IT

(Excerpt from original post on the Taneja Group News Blog)

As a long-time performance and capacity planning consultant I used to help IT organizations in the worst of situations remediate thorny resource allocation issues. In other words, someone just bought a lot of expensive vendor-specified infrastructure but the resulting performance was still terrible. Sometimes after having been burned a few times, or still retaining some corporate mainframe era derived wisdom, they would engage expert help to actually forward plan optimal infrastructure investments (i.e. before spending the money). Not everybody had the discipline, budget, or maturity for proactive planning, resulting in a lot of unnecessary performance pain delivered to end-users with many IT shops living in fire-fighting mode. In fact, I knew several IT admins that thrived on the adrenaline of the daily fire-fight!
Once actually delivering good service to end-users finally became a popular IT goal one of the big attractions of virtualization technologies was that they enabled highly responsive and even dynamic allocations of resources on demand. Many IT folks assumed this would alleviate the need for up front capacity planning because you could now easily and quickly react to performance problems by allocating more resources dynamically from a shared pool. In fact, higher-end capabilities of hypervisors can automate dynamic resource assignment and leveling through judicious setting of resource prioritization policies. And it does work well but only up to a point.
Now we have at least three “new” capacity management challenges. The biggest one is sizing the resources needed for the entire resource pool. As we virtualize more and more of our mission-critical applications it’s ever more important that the entire cluster be able to handle the aggregate demands of many kinds of applications co-hosted together. Despite increasingly popular modular scale-out virtual infrastructure solutions, this still requires capacity planning at the larger scale or you risk overspending on quickly obsoleting infrastructure (remember Moore’s law will get you more for you money the later you spend it), or face severe performance bottlenecks at the worst possible times when critical applications peak together. Capacity planning has always been about right sizing the right infrastructure at the right time. Sure, hybrid cloud bursting is just around the corner for many as yet another reactive panacea to in-house resource constraints, yet its still possible to overspend on cloud allocations, or under subscribe with poor resulting performance. While AWS is elastic, it’s elastic at the machine level with the best cost management offered by reserving known volumes of machines in advance.
The second issue is that as we virtualize deeper into our mission critical applications portfolio, we simply can’t continue to guess at what virtual resources might deliver satisfactory application performance and trust that the reactive system dynamics will smooth everything out. Virtualization is essentially sharing, and good sharing schemes require a sound understanding of the resource demands required by each application within each vm in order to set the knobs and buttons to do the right thing at run-time. It’s possible and maybe even desirable to oversubscribe the low-hanging fruit of servers in test and dev, but don’t try that with your mission critical apps in production.
Finally, much of what is happening in IT infrastructure these days is converging. It’s no longer sufficient to examine performance or capacity plan silo by silo (if it ever really was). Today, it’s critical that capacity management take a holistic view across servers, storage, networking, and any other critical resources. And with the advent of clouds, capacity management isn’t limited to the data center anymore either. It’s an enterprise function at the CIO visible level.
The bottom-line is that performance analysis and capacity planning disciplines aren’t even close to dead, although there are fewer and fewer adherents who learned the formal discipline in big iron. What’s needed for this new generation is a competitive approach to optimizing total IT spend for maximum business value that can be leveraged by the average virtual admin. It’s been hard for classic capacity management vendors to evolve their tooling as fast as virtualization and cloud technologies mature, but there are a few standouts. TeamQuest for one has not only been thriving as an employee-owned firm for many years, but is actively investing in and expanding their solutions. Recently they folded in a product called Surveyor which promises to stitch together whatever systems or infrastructure data, financial management, and other data you have into a cohesive ready-to-roll analytical and reporting environment. They claim painless deployment in that it effectively creates a virtual capacity management database over all your other tools and data sources without having to ETL or create yet another monolithic database repository.
TeamQuest’s core capacity planning for servers is based on non-linear predictive modeling that relates interactive system response time to resource utilizations (via expected workload demands). Non-linear modeling can analytically “predict” the right size infrastructure proactively to guarantee end-user performance goals. A non-linear queuing analysis is also baked into NetApp’s Balance solution that enables it to identify the optimal “balance” between loading and performance in virtual infrastructures accounting for not only virtual server resources, but also attached storage arrays. Key to its value is the cross-domain way it pierces through layers of virtualization to stitch together an end-to-end cross domain performance perspective with analysis from within the vm, from the hypervisor, and from the storage array points of view.
Old school capacity planning might be dead, but long live the new virtual infrastructure capacity management!

…(read the full post)