Follow That Transaction! APM Across Clouds

It’s one thing to stand up a new cloud-native application, but it’s impossible to manage end-to-end performance using the same approaches and tooling we’ve long used in the data center.

I’m hearing a lot about transitional “new” cloud-native applications that actually combine and span layers of existing persistent storage, traditional data stores and key legacy application functionality hosted in VM’s, and containerized new development. Existing back-end “stuff with API’s” can be readily topped and extended now by thousands (hundreds of thousands?) of microservices running web-like across hybrid and multi-cloud platform hosting. Even the idea of what makes up any particular application can get pretty fuzzy.

While there are certainly security, data protection, and availabilty/resilience concerns to be sure, the problem we are talking about today is that when you pile up that much complexity and scale, assuring production performance can become quite a challenge.

Transactional History in Performance Management

Performance management includes monitoring targeted service levels but also should provide ways to identify both sudden and creeping problems, troubleshoot down to root cause (and then help remediate in situ), optimize bottlenecks to provide better service (endless, because there is always a “next” longest pole in the tent) and plan/predict for possible changes in functionality, usage and resources (capacity planning in the cloud era).

I spent many years working for one of the so-called “big 4” system management companies implementing infrastructure capacity planning and performance management (IPM) solutions in Fortune 500 company data centers. With a little operational queuing modeling work on rather monolithic workloads (mainframe, AS400, mid-range UNIX…), we could help steer multi-million dollar IT buys towards the right resources that would solve today’s problems and assure future performance.

A core concept is the idea of the mythical “workload transaction” as the unit of application work. In those days, at least for capacity planning, we could get away with a statistical transaction unit of work. For example, we’d observe a certain amount of active usage on a given system in terms of it’s CPU utilization, memory, IO, etc., and then divide those metrics by an arbitrary usage metric (e.g. number of known users, # of IO’s written, processes forked, forms processed, functionality points, or the default generic cpu-second itself). This statistical modeling approach worked remarkably well in helping right-size, right-time, and right-host infrastructure investments.

However, this approach only went so far when it came to troubleshooting or optimizing within an application. We could readily look at application behavior in some kind of aggregate way, maybe isolating behavior down to a specific observed process (or set of identified processes). You could even in some cases get developers to add some instrumentation (anybody remember ARM?) into key application code to count and report on arbitrary app-specific transaction counts. Of course this was rarely achievable in practice (most business critical code was 3rd party and painful performance problems that needed solving “fast” were already in production).

If you needed to go inside the app itself, or track individual transactions across a distributed system (classically a 3-tier presentation/business logic/database architecture), you needed application insight from another set of tools that came to be called Application Performance Management (APM). APM solutions aimed to provide performance insight into application specific transaction “definitions”. Instrumentation for transaction tracking was often “inserted” early into the app development process. Of course this still requires some up front discipline.  Or a non-intrusive (but in many ways halfway) approach might capture network traffic and parse it (with deep packet inspection DPI) to produce information on transactional workflow and sometimes drill down to identify individual transactions flowing between systems.

Hybrid Containerized PM

It’s impossible to follow a unique transaction across today’s potentially huge web of containerized microservices. I think of it visually as similar to how our neurons theoretically fire and cascade in the brain – an overlapping mesh of activity. We can see behavior in aggregate easily enough, but tracking what goes into each unique transaction?

First we need to realize that transaction workflow in this kind of environment is naturally complex. Application devs (and third party services) can implement messaging busses, delivery queues, make synchronous calls and at the same fire asynchronous events and triggers, span arbitrarily large pauses (to account for human interactions like web page interaction), cause large cascades, aggregate behavior (trigger something X every 10 Y’s), and so on.

The only real approach to tracking unique transactions is still through instrumentation.  Luckily there is a “tracing” standard (see Opentracing project). But tracing is even more challenging at large scale (and across dynamic and abstracted platform hosting).  How much data (and how fast) can something like Splunk take in constant instrumentation data from 100,000’s of microservices (and how much will that cost)? This can easily be a case where performance measurement uses as much or more resource than the app itself.

To tackle this, there are some folks rolling out practical tracing services designed to tackle both the distributed complexity and huge scales involved. This last week LightStep rolled out of stealth (founder Ben Sigelman was instrumental in Opentracing 🙂 ). LightStep [x]PM, a managed service offering that incurs minimal performance analysis overhead on site, provides 100% transaction tracing at scale by doing some sophisticated sampling during aggregation/monitoring, but preserving full tracing info for immediate audit/drill down.  LightStep has some impressively large scale use cases already stacked up.

FaaS Performance Management

This of course is not the end of the transactional tracing saga. I’ve written before about Fission, a developing open source Function as a Service layer (FaaS ontop of K8). That project has now recently started on a next layer called Fission Workflow, which implements a YAML-like blueprint file to declare and stitch together functions into larger workflows (compare to AWS Step functionality).  I think workflows of functions will naturally correspond to interesting “application” transactions.

And FaaS workflows could very well be the future of application development. Each function runs as a container, but by using something like Fission the developer doesn’t need to know about containers or container management. And when it comes to generating  performance insight across webs of functions, for example, the Fission Workflow engine itself can (or will) explicitly track transactions across wherever they are defined to flow (tracing state/status, timing, etc).

[check out this interesting Fission Workflow work in progress page for some categorization of the complexity for tracking async “waiting” workflows…]

This immediately makes me want to collect Fission Workflow data into something like Cassandra and play with subsets in Spark (esp. graph structured queries and visualization).  There a lot of new frontiers here to explore.

I can’t wait to see what comes next!


I’m Going Fission

I just spent a couple of weeks in Boston at Red Hat Summit and OpenStack Summit.  Containers are clearly the big thing this year – Kubernetes, Openshift, etc. And increasingly, IT is learning how to take advantage of remote Management As A Service (MaaS) offerings that free up folks to focus more on business value and less on running complex stacks. On that front I talked with folks like Platform9, who happen to also sponsor a “server-less” computing solution called Fission (- later in this blog post I’ll show how I got Fission deployed to my Mac).

Because I’m an industry analyst (in my day job), here is a big picture of the evolution happening in application infrastructure: Physically hosted apps (server and O/S) –> Virtual machines (in a hypervisor) –> Cloud platforms (e.g. OpenStack) –> Container “ships” (e.g. OpenShift, Docker, Kubernetes) –> Serverless Computing (e.g. AWS Lambda and Fission).

Applications have always been constructed out of multiple tiers and communicating parts, but generally we are moving towards a world in which functionality is both defined and deployed (distributable, scalable) in small, testable bits (i.e. “units” as in unit testing), while an application “blueprint” defines all the related bits and required service properties in operation.  Some folks are calling the blueprinting part “infrastructure as code”.

(BTW – the next evolutionary step is probably some kind of highly intelligent, dynamic IoT/Big Data/Distributed engine that inherently analyzes and distributes compute functionality out as far as it can go towards the IoT edge while centralizing data only as much as required. Kind of like a database query planner on IoT-size steroids).

So, onto my Mac deployment of Fission. I’ve already got VirtualBox installed for running Hadoop cluster sandboxes and other fun projects, but OpenStack is probably not something I really need or want to run on my own Mac (although apparently I could if I wanted more agility in spinning up and down big data clusters). But – Ah ha! – now a mental lightbulb goes on! (or rather, an LED went on – gotta save power these days).

This Fission project means I can run my own lambda services now too on my little desktop Mac, and then easily deploy really cool stuff to really big clouds when someday I create that killer app (with lambdas that happily interface with other coolness like Spark, Neo4j, Ruby on Rails…).  Ok, this is definitely something I want to play with.  And I’m thinking, wait for it –  Ruby lambdas!  (Ruby is not dead, you fools! You’ll all eventually see why Ruby is the one language that will be used in the darkness to bind them all!)

Well, we’ll come back to Ruby later.  First things first – we’ll start with the default node.js example. Let’s aim for a local nested stack that will run like this:

osx (-> virtualbox (-> minikube (-> fission (-> node.js))))

host server – hypervisor – container cluster – lambda services – execution environment

While the lambda execution will be nested, the CLI commands to interface with minikube/kubernetes (kubectl) and fission (fission) will be available locally at the osx command line (in a terminal window).

Ok, I’ve already got VirtualBox, but it’s out of date for minikube. So I directly download the latest off the web and install – oops, first issue! Mac OSX now has some fancy SIP security layer that prevents anyone from actually getting anything done as root (I swear if they keep making my unix-based Mac work like IOS I’m gonna convert to Ubuntu!). So after working around security to get that update in place (and thank you Oracle for VirtualBox) we are moving on!

$ virtualbox
Oh, and make sure to also have kubectl installed locally. The local kubectl will get dynamically linked into the minikube kubernetes environment that will be running inside virtualbox.
$ curl -Lo kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
$ kubectl version

For the minikube install I used brew, which of course I had to update first. And of course, I had to again work around the Mac OSX SIP challenge above (hopefully this is a one time fix) by setting /usr/local directly ownership to myself (then back to root:wheel after the dust settled).

$ brew update
$ brew cask install minikube
$ minikube start 
# minikube stop
# minikube service [-n NAMESPACE] [--url] NAME
$ minikube ip
$ minikube dashboard

At this point you can deploy containerized apps with kubectl into the minikube “cluster”.  This next bit is an example of a simple “echo” server from the minikube github.

$ kubectl run hello-minikube --port=8080
$ kubectl expose deployment hello-minikube --type=NodePort
$ kubectl get pod
$ curl $(minikube service hello-minikube --url)

(If you are following along, you might suggest that I should play here with minishift too, but now is not yet the time! Maybe I’ll climb into that PaaS arena in another post.)

Now it’s time for Fission. These next snippets are taken from fission github readme page. The first curl gets the fission client command lines installed locally. The kubectl lines start the fission services up. The two shell variables are just for convenience of the provided example, and not part of the required install.

$ curl > fission && chmod +x fission && sudo mv fission /usr/local/bin/

$ kubectl create -f
$ kubectl create -f

$ export FISSION_URL=http://$(minikube ip):31313
$ export FISSION_ROUTER=$(minikube ip):31314    (for these examples)

Ok, so now have our own lambda services host running. Next we can start deploying lambda functions. Fission does a number of things like scale-out our services and keep a few containers ready for fast startup, and probably a bunch of stuff I won’t figure out until some O’Reilly book comes out (oh, I could just read the code…).

$ fission env create --name nodejs --image fission/node-env
$ curl > hello.js

$ fission function create --name hello --env nodejs --code hello.js
$ fission route create --method GET --url /hello --function hello

First, we create a fission environment associating a fission environment container image with the name “nodejs”. Then we create a fission function with our functional lambda hello.js “code” into that fission environment. Here we are using javascript and node.js, but there are other execution environments available (and we can make our own!). We also then need to map a web services route to our fission function.

module.exports = async function(context) {
    return {
        status: 200,
        body: "Hello, World!\n"

You can see that a Fission lambda function is just a javascript function. In this case all it does is return a standard HTTP response.

$ curl http://$FISSION_ROUTER/hello
 ->  Hello, World!

Testing it out – we hit the URL with a GET request and tada!  Hello World!

This is quite an onion we’ve built, but you hopefully can appreciate that each layer is adding to the architecture that would enable easy deployment at large scale and wide distribution down the road. Next up though, I personally want Ruby lambdas!

I could build a full native ruby fission environment (should be easy enough to start with an existing RH or docker ruby container). There is a python fission example that wouldn’t be hard to emulate. I’d have to decide on key gems to pre-load, and that leads to a big question on what I’d like to actually do and how big and fat that environment might get (which could be slow and bloated). Or we could try to stay very small – there have been small embeddable ruby’s like mruby (although that one looks dead since 2015). There is also some interesting advice for building minimal ruby app containers .

While not actually ruby, CoffeeScript transpiling ruby-like coffeescript code to javascript seems the easiest route at the moment, and just uses the vanilla fission node.js environment we already have above. I could also see also embedding “coffee” in a fission environment easily enough so that I could send coffeescript code directly to fission (although that would require transpiling on every lambda execution – it’s always a trade-off). To get started with coffee, add it to your local node.js environment (install Node first if you don’t already have that).

$ npm install -g coffee-script
$ coffee

Using coffee is easy enough. Learning it might take a bit of study, although if you like ruby and only suffer when forced to work with native Javascript, it’s well worth it.

But CoffeeScript is not ruby.  Something like Opal (transpiling full ruby syntax to js) is an even more interesting project, and if it was ever solid it could be implemented here with fission in a number of ways – possibly embedding it in a unique Opal ruby fission environment, statically applying it upstream from a node.js fission environment like with CoffeeScript, or even using it dynamically as a wrapper with ruby code sent to the node.js environment.

Another idea is to build a small ruby native fission solution with something like a nested ruby Sinatra design. First create a local “super-fission-sinatra” DSL that would deploy sinatra-like web service definition code to an embedded ruby/sinatra fission environment. Kind of meta-meta maybe, but maybe an interesting way to build scalable, instrumented API’s.

All right – that’s enough for now. Time to play! Let me know if you create any Ruby Fission examples!