Follow That Transaction! APM Across Clouds

It’s one thing to stand up a new cloud-native application, but it’s impossible to manage end-to-end performance using the same approaches and tooling we’ve long used in the data center.

I’m hearing a lot about transitional “new” cloud-native applications that actually combine and span layers of existing persistent storage, traditional data stores and key legacy application functionality hosted in VM’s, and containerized new development. Existing back-end “stuff with API’s” can be readily topped and extended now by thousands (hundreds of thousands?) of microservices running web-like across hybrid and multi-cloud platform hosting. Even the idea of what makes up any particular application can get pretty fuzzy.

While there are certainly security, data protection, and availabilty/resilience concerns to be sure, the problem we are talking about today is that when you pile up that much complexity and scale, assuring production performance can become quite a challenge.

Transactional History in Performance Management

Performance management includes monitoring targeted service levels but also should provide ways to identify both sudden and creeping problems, troubleshoot down to root cause (and then help remediate in situ), optimize bottlenecks to provide better service (endless, because there is always a “next” longest pole in the tent) and plan/predict for possible changes in functionality, usage and resources (capacity planning in the cloud era).

I spent many years working for one of the so-called “big 4” system management companies implementing infrastructure capacity planning and performance management (IPM) solutions in Fortune 500 company data centers. With a little operational queuing modeling work on rather monolithic workloads (mainframe, AS400, mid-range UNIX…), we could help steer multi-million dollar IT buys towards the right resources that would solve today’s problems and assure future performance.

A core concept is the idea of the mythical “workload transaction” as the unit of application work. In those days, at least for capacity planning, we could get away with a statistical transaction unit of work. For example, we’d observe a certain amount of active usage on a given system in terms of it’s CPU utilization, memory, IO, etc., and then divide those metrics by an arbitrary usage metric (e.g. number of known users, # of IO’s written, processes forked, forms processed, functionality points, or the default generic cpu-second itself). This statistical modeling approach worked remarkably well in helping right-size, right-time, and right-host infrastructure investments.

However, this approach only went so far when it came to troubleshooting or optimizing within an application. We could readily look at application behavior in some kind of aggregate way, maybe isolating behavior down to a specific observed process (or set of identified processes). You could even in some cases get developers to add some instrumentation (anybody remember ARM?) into key application code to count and report on arbitrary app-specific transaction counts. Of course this was rarely achievable in practice (most business critical code was 3rd party and painful performance problems that needed solving “fast” were already in production).

If you needed to go inside the app itself, or track individual transactions across a distributed system (classically a 3-tier presentation/business logic/database architecture), you needed application insight from another set of tools that came to be called Application Performance Management (APM). APM solutions aimed to provide performance insight into application specific transaction “definitions”. Instrumentation for transaction tracking was often “inserted” early into the app development process. Of course this still requires some up front discipline.  Or a non-intrusive (but in many ways halfway) approach might capture network traffic and parse it (with deep packet inspection DPI) to produce information on transactional workflow and sometimes drill down to identify individual transactions flowing between systems.

Hybrid Containerized PM

It’s impossible to follow a unique transaction across today’s potentially huge web of containerized microservices. I think of it visually as similar to how our neurons theoretically fire and cascade in the brain – an overlapping mesh of activity. We can see behavior in aggregate easily enough, but tracking what goes into each unique transaction?

First we need to realize that transaction workflow in this kind of environment is naturally complex. Application devs (and third party services) can implement messaging busses, delivery queues, make synchronous calls and at the same fire asynchronous events and triggers, span arbitrarily large pauses (to account for human interactions like web page interaction), cause large cascades, aggregate behavior (trigger something X every 10 Y’s), and so on.

The only real approach to tracking unique transactions is still through instrumentation.  Luckily there is a “tracing” standard (see Opentracing project). But tracing is even more challenging at large scale (and across dynamic and abstracted platform hosting).  How much data (and how fast) can something like Splunk take in constant instrumentation data from 100,000’s of microservices (and how much will that cost)? This can easily be a case where performance measurement uses as much or more resource than the app itself.

To tackle this, there are some folks rolling out practical tracing services designed to tackle both the distributed complexity and huge scales involved. This last week LightStep rolled out of stealth (founder Ben Sigelman was instrumental in Opentracing 🙂 ). LightStep [x]PM, a managed service offering that incurs minimal performance analysis overhead on site, provides 100% transaction tracing at scale by doing some sophisticated sampling during aggregation/monitoring, but preserving full tracing info for immediate audit/drill down.  LightStep has some impressively large scale use cases already stacked up.

FaaS Performance Management

This of course is not the end of the transactional tracing saga. I’ve written before about Fission, a developing open source Function as a Service layer (FaaS ontop of K8). That project has now recently started on a next layer called Fission Workflow, which implements a YAML-like blueprint file to declare and stitch together functions into larger workflows (compare to AWS Step functionality).  I think workflows of functions will naturally correspond to interesting “application” transactions.

And FaaS workflows could very well be the future of application development. Each function runs as a container, but by using something like Fission the developer doesn’t need to know about containers or container management. And when it comes to generating  performance insight across webs of functions, for example, the Fission Workflow engine itself can (or will) explicitly track transactions across wherever they are defined to flow (tracing state/status, timing, etc).

[check out this interesting Fission Workflow work in progress page for some categorization of the complexity for tracking async “waiting” workflows…]

This immediately makes me want to collect Fission Workflow data into something like Cassandra and play with subsets in Spark (esp. graph structured queries and visualization).  There a lot of new frontiers here to explore.

I can’t wait to see what comes next!

 

Reap IT automation benefits in every layer of the stack

An IT industry analyst article published by SearchITOperations.


article_Reap-IT-automation-benefits-in-every-layer-of-the-stack
Automation technologies create an artificial brain for IT operations, but that won’t turn skilled admins and engineers into zombies — far from it.

Mike Matchett
Small World Big Data

As a technology evangelist and professional IT systems optimizer, I see the benefits of IT automation and can only champion trends that increase it. When we automate onerous tasks and complex manual procedures, we naturally free up time to focus our energies higher in the stack. Better and more prevalent automation increases the relative return on our total effort so that we each become more productive and valuable. Simply put, IT automation provides leverage. So it’s all good, right?

Another IT automation benefit is that it captures, encapsulates and applies valuable knowledge to real-world problems. And actually, it’s increasingly hard to find IT automation platforms that don’t promote embedded machine learning and artificially intelligent algorithms. There is a fear that once our hard-earned knowledge is automated, we’ll no longer be necessary.

So, of course, I need to temper my automation enthusiasm. Automation can eliminate low-level jobs, and not everyone can instantly adjust or immediately convert to higher-value work. For example, industrial robots, self-driving cars or a plethora of internet of things (IoT)-enabled devices that cut out interactions with local retailers all tend to remove the bottom layer of the related pyramid of available jobs. In those situations, there will be fewer, more-utilized positions left as one climbs upward in skill sets.

Still, I believe automation, in the long run, can’t help but create even more pyramids to climb. We are a creative species after all. Today, we see niches emerging for skilled folks with a combination of internal IT and, for example, service provider, high-performance computing, data science, IoT and DevOps capabilities.

Automation initiatives aren’t automatic

If one squints a bit, almost every IT initiative aims to increase automation.

A service provider has a profit motive, so the benefit of IT automation is creating economies of scale. Those, in turn, drive competitive margins. But even within enterprise IT, where IT is still booked as a cost center, the drive toward intelligent automation is inevitable. Today, enterprise IT shops, following in the footsteps of the big service providers, are edging toward hybrid cloud-scale operations internally and finding that serious automation isn’t a nice-to-have, but a must-have.If one squints a bit, almost every IT initiative aims to increase automation. Most projects can be sorted roughly into these three areas with different IT automation benefits, from cost savings to higher uptime:

  • Assurance. Efforts to automate support and help desk tasks, shorten troubleshooting cycles, shore up security, protect data, reduce outages and recover operations quickly.
  • Operations. Necessary automation to stand up self-service catalogs, provision apps and infrastructure across hybrid and multi-cloud architectures to enable large-scale operations, and orchestrate complex system management tasks.
  • Optimization. Automation that improves or optimizes performance in complex, distributed environments, and minimizes costs through intelligent brokering, resource recovery and dynamic usage balancing.

Automation enablers at large
Successful automation initiatives don’t necessarily start by implementing new technologies like machine learning or big data. Organizational commitment to automation can drive a whole business toward a new, higher level of operational excellence…(read the complete as-published article there)