Follow That Transaction! APM Across Clouds

It’s one thing to stand up a new cloud-native application, but it’s impossible to manage end-to-end performance using the same approaches and tooling we’ve long used in the data center.

I’m hearing a lot about transitional “new” cloud-native applications that actually combine and span layers of existing persistent storage, traditional data stores and key legacy application functionality hosted in VM’s, and containerized new development. Existing back-end “stuff with API’s” can be readily topped and extended now by thousands (hundreds of thousands?) of microservices running web-like across hybrid and multi-cloud platform hosting. Even the idea of what makes up any particular application can get pretty fuzzy.

While there are certainly security, data protection, and availabilty/resilience concerns to be sure, the problem we are talking about today is that when you pile up that much complexity and scale, assuring production performance can become quite a challenge.

Transactional History in Performance Management

Performance management includes monitoring targeted service levels but also should provide ways to identify both sudden and creeping problems, troubleshoot down to root cause (and then help remediate in situ), optimize bottlenecks to provide better service (endless, because there is always a “next” longest pole in the tent) and plan/predict for possible changes in functionality, usage and resources (capacity planning in the cloud era).

I spent many years working for one of the so-called “big 4” system management companies implementing infrastructure capacity planning and performance management (IPM) solutions in Fortune 500 company data centers. With a little operational queuing modeling work on rather monolithic workloads (mainframe, AS400, mid-range UNIX…), we could help steer multi-million dollar IT buys towards the right resources that would solve today’s problems and assure future performance.

A core concept is the idea of the mythical “workload transaction” as the unit of application work. In those days, at least for capacity planning, we could get away with a statistical transaction unit of work. For example, we’d observe a certain amount of active usage on a given system in terms of it’s CPU utilization, memory, IO, etc., and then divide those metrics by an arbitrary usage metric (e.g. number of known users, # of IO’s written, processes forked, forms processed, functionality points, or the default generic cpu-second itself). This statistical modeling approach worked remarkably well in helping right-size, right-time, and right-host infrastructure investments.

However, this approach only went so far when it came to troubleshooting or optimizing within an application. We could readily look at application behavior in some kind of aggregate way, maybe isolating behavior down to a specific observed process (or set of identified processes). You could even in some cases get developers to add some instrumentation (anybody remember ARM?) into key application code to count and report on arbitrary app-specific transaction counts. Of course this was rarely achievable in practice (most business critical code was 3rd party and painful performance problems that needed solving “fast” were already in production).

If you needed to go inside the app itself, or track individual transactions across a distributed system (classically a 3-tier presentation/business logic/database architecture), you needed application insight from another set of tools that came to be called Application Performance Management (APM). APM solutions aimed to provide performance insight into application specific transaction “definitions”. Instrumentation for transaction tracking was often “inserted” early into the app development process. Of course this still requires some up front discipline.  Or a non-intrusive (but in many ways halfway) approach might capture network traffic and parse it (with deep packet inspection DPI) to produce information on transactional workflow and sometimes drill down to identify individual transactions flowing between systems.

Hybrid Containerized PM

It’s impossible to follow a unique transaction across today’s potentially huge web of containerized microservices. I think of it visually as similar to how our neurons theoretically fire and cascade in the brain – an overlapping mesh of activity. We can see behavior in aggregate easily enough, but tracking what goes into each unique transaction?

First we need to realize that transaction workflow in this kind of environment is naturally complex. Application devs (and third party services) can implement messaging busses, delivery queues, make synchronous calls and at the same fire asynchronous events and triggers, span arbitrarily large pauses (to account for human interactions like web page interaction), cause large cascades, aggregate behavior (trigger something X every 10 Y’s), and so on.

The only real approach to tracking unique transactions is still through instrumentation.  Luckily there is a “tracing” standard (see Opentracing project). But tracing is even more challenging at large scale (and across dynamic and abstracted platform hosting).  How much data (and how fast) can something like Splunk take in constant instrumentation data from 100,000’s of microservices (and how much will that cost)? This can easily be a case where performance measurement uses as much or more resource than the app itself.

To tackle this, there are some folks rolling out practical tracing services designed to tackle both the distributed complexity and huge scales involved. This last week LightStep rolled out of stealth (founder Ben Sigelman was instrumental in Opentracing 🙂 ). LightStep [x]PM, a managed service offering that incurs minimal performance analysis overhead on site, provides 100% transaction tracing at scale by doing some sophisticated sampling during aggregation/monitoring, but preserving full tracing info for immediate audit/drill down.  LightStep has some impressively large scale use cases already stacked up.

FaaS Performance Management

This of course is not the end of the transactional tracing saga. I’ve written before about Fission, a developing open source Function as a Service layer (FaaS ontop of K8). That project has now recently started on a next layer called Fission Workflow, which implements a YAML-like blueprint file to declare and stitch together functions into larger workflows (compare to AWS Step functionality).  I think workflows of functions will naturally correspond to interesting “application” transactions.

And FaaS workflows could very well be the future of application development. Each function runs as a container, but by using something like Fission the developer doesn’t need to know about containers or container management. And when it comes to generating  performance insight across webs of functions, for example, the Fission Workflow engine itself can (or will) explicitly track transactions across wherever they are defined to flow (tracing state/status, timing, etc).

[check out this interesting Fission Workflow work in progress page for some categorization of the complexity for tracking async “waiting” workflows…]

This immediately makes me want to collect Fission Workflow data into something like Cassandra and play with subsets in Spark (esp. graph structured queries and visualization).  There a lot of new frontiers here to explore.

I can’t wait to see what comes next!

 

What’s a Software Defined Data Center? – Pensa Aims Really High

This week Pensa came out of their stealthy development phase to announce the launch of their company and their Pensa Maestro cloud-based (SaaS) platform, accessible today through an initial service offering called Pensa Lab. The technology here has great opportunity, and importantly the team at Pensa is firming up with the best folks (I used to work for Tom Joyce).

I’m not sure we analysts have firmed out all the words to easily describe what they do yet, but basically Pensa provides a way to define the whole data center in code, validate it as a model, and then pull a trigger and aim it at some infrastructure to automatically deploy it. Data centers on demand!  Of course, doing all the background tranfigurations to validate and actually deploy this über level of complexity and scale requires big smarts – a large part of the magic here is some cleverly applied ML algorithms to drive required transformations, ensure policies and set up SDN configurations.

What is Software Defined?

So let’s back up a bit and explore some of the technologies involved – one of the big benefits of software and software-defined resources is that they can be spun up dynamically (and readily converged within compute hosts with applications and other software defined resources). These software-side “resources” are usually provisioned and configured through “editable model/manifest files/templates” – so-called “infrastructure as code”. Because they are implemented in software they are often also dynamically re-configurable and remotely programmable through API’s.

Application Blueprinting for DevOps

On the other side of the IT fence, applications are increasingly provisioned and deployed dynamically via recipes or catalog-style automation, which in turn rely on internal application “blueprint” or container manifest files that can drive automated configuration and deployment of application code and needed resources, like private network connections, storage volumes and specific data sets. This idea is most visible in new containerized environments, but we also see application blueprinting coming on strong for legacy hypervisor environments and bare metal provisioning solutions too.

Truly Software Defined Data Centers

If you put these two ideas together – SD and application blueprinting, you might envision a truly software defined data center describable fully in code. With some clever discovery solutions, you can imagine that an existing data center could be explored and captured/documented into a model file describing a complete blueprint for both infrastructure and applications (and the enterprise services that wrap around them). Versions of that data center “file” could be edited as desired (e.g. to make a test or dev version perhaps), with the resulting data center models deployable at will on some other actual infrastructure – like “another” public cloud.

Automation of this scenario requires an intelligent translation of high-level blueprint service and resource requirements into practical provisioning and operational configurations on specifically target infrastructure. But imagine being able to effectively snapshot your current data center top to bottom, and them be able to deploy a full, complete copy on demand for testing, replication or even live DR  (we might call this a “live re-inflation DR” (or LR-DR) scenario).

Of course, today’s data center is increasingly hybrid/multi-cloud consisting of a mix of physical, virtual machines and containerized apps and corporate data. But through emerging cutting-edge IT capabilities like hybrid-supporting software defined networking and storage, composable bare metal provisioning, virtualizing hypervisors and cloud-orchestration stacks, container systems, PaaS, and hybrid cloud storage services (e.g. HPE’s Cloud Volumes), it’s becoming possible to not just blueprint and dynamically deploy applications, but soon the whole data center around them.

There is no way that VMware, whose tagline has been SDDC for some time, will roll over and cede the territory here completely to Pensa (or any other startup). But Pensa now has a live service out there today – and that could prove disruptive to the whole enterprise IT  market.

What’s a Multi-cloud Really?  Some Insider Notes from VMworld 2017

(Excerpt from original post on the Taneja Group News Blog)

As comfortable 65-70 degree weather blankets New England here as we near end of summer, flying into Las Vegas for VMworld at 110 degrees seemed like dropping into hell. Last time I was in that kind of heat I was stepping off a C-130 into the Desert Shield/Desert Storm theater of operations. At least here, as everyone still able to breathe immediately says -“at least it’s a dry heat.”

…(read the full post)

Open Wide and Say Ahh!

(Excerpt from original post on the Taneja Group News Blog)

I’ve been immersed in “Open” for the last two weeks here in Boston, attending both Red Hat Summit 2017 and then OpenStack Summit. There are quite a few things worth paying attention to, especially if you are an enterprise IT shop still wondering how your inevitable cloud (and services) transformation is really going to play out, including accelerating application migration to containers and the rise of platform Management as a Service.

…(read the full post)

Actual Hybrid of Enterprise Storage and Public Cloud? Oracle creates a Cloud Converged System

(Excerpt from original post on the Taneja Group News Blog)

What’s a Cloud Converged system? It is really what us naive people thought hybrid storage was all about all along.  Yet until now no high performance enterprise class storage ever actually delivered it.  But now, Oracle’s latest ZFS Storage Appliance, the ZS5, comes natively integrated with Oracle Cloud storage. What does that mean? On-premise ZS5 Storage Object pools now extend organically into Oracle Cloud storage (which is also made up of ZS storage) – no gateway or third party software required.
 
Oracle has essentially brought enterprise hybrid cloud storage to market, no integration required. I’m not really surprised that Oracle has been able to roll this out, but I am a little surprised that they are leading the market in this area.
 
Why hasn’t Dell EMC come up with a straightforward hybrid cloud leveraging their enterprise storage and cloud solutions? Despite having all the parts, they failed to actually produce the long desired converged solution – maybe due to internal competition between infrastructure and cloud divisions? Well, guess what. Customers want to buy hybrid storage, not bundles or bunches of parts and disparate services that could be integrated (not to mention wondering who supports the resulting stack of stuff).
 
Some companies so married to their legacy solutions that they, like NetApp for example, don’t even offer their own cloud services – maybe they were hoping this cloud thing would just blow over? Maybe all those public cloud providers would stick with web 2.0 apps and wouldn’t compete for enterprise GB dollars?
 
(Microsoft does have StorSimple which may have pioneered on-prem storage integrated with cloud tiering (to Azure). However, StorSimple is not a high performance, enterprise class solution (capable of handling PBs+ with massive memory accelerated performance). And it appears that Microsoft is no longer driving direct sales of StorSimple, apparently positioning it now only as one of many on-ramps to herd SME’s fully into Azure.)
 
We’ve reported on the Oracle ZFS Storage Appliance itself before. It has been highly augmented over the years. The Oracle ZFS Storage Appliance is a great filer on its own, competing favorably on price and performance with all the major NAS vendors. And it provides extra value with all the Oracle Database co-engineering poured into it.  And now that it’s inherently cloud enabled, we think for some folks it’s likely the last storage NAS they will ever need to invest in (if you’ll want more performance, you will likely move to in-memory solutions, and if you want more capacity – well that’s what the cloud is for!).
 
Oracle’s Public Cloud is made up of – actually built out of – Oracle ZFS Storage Appliances. That means the same storage is running on the customer’s premise as in the public cloud they are connected with. Not only does this eliminate a whole raft of potential issues, but solving any problems that might arise is going to be much simpler – (and less likely to happen given the scale of Oracle’s own deployment of their own hardware first).
 
Compare this to NetApp’s offering to run a virtual image of NetApp storage in a public cloud that only layers up complexity and potential failure points. We don’t see many taking the risk of running or migrating production data into that kind of storage. Their NPS co-located private cloud storage is perhaps a better offering, but the customer still owns and operates all the storage – there is really no public cloud storage benefit like elasticity or utility pricing.
 
Other public clouds and on-prem storage can certainly be linked with products like Attunity CloudBeam, or additional cloud gateways or replication solutions.  But these complications are exactly what Oracle’s new offering does away with.
 
There is certainly a core vendor alignment of on-premises Oracle storage with an Oracle Cloud subscription, and no room for cross-cloud brokering at this point. But a ZFS Storage Appliance presents no more technical lock-in than any other NAS (other than the claim that they are more performant at less cost, especially for key workloads that run Oracle Database.), nor does Oracle Cloud restrict the client to just Oracle on-premise storage.
 
And if you are buying into the Oracle ZFS family, you will probably find that the co-engineering benefits with Oracle Database (and Oracle Cloud) makes the set of them all that much more attractive (technically and financially). I haven’t done recent pricing in this area, but I think we’d find that while there may be cheaper cloud storage prices per vanilla GB out there, looking at the full TCO for an enterprise GB, hybrid features and agility could bring Oracle Cloud Converged Storage to the top of the list.

…(read the full post)

The New Big Thing in Big Data: Results From Our Apache Spark Survey

(Excerpt from original post on the Taneja Group News Blog)

In the last few months I’ve been really bullish on Apache Spark as an big enabler of wider big data solution adoption. Recently we got the great opportunity to conduct some deep Spark market research (with Cloudera’s sponsorship) and were able to survey nearly seven thousand (6900+) highly qualified technical and managerial people working with big data from around the world.
   
Some highlights — First, across the broad range of industries, company sizes, and big data maturities, over one-half (54%) of respondents are already actively using Spark to solve a primary organizational use case. That’s an incredible adoption rate, and no doubt due to the many ways Spark makes big data analysis accessible to a much wider audience – not just Phd’s but anyone with a modicum of SQL and scripting skills.
   
When it comes to use cases, in addition to the expected Data Processing/Engineering/ETL use case (55%), we found high rates of forward-looking and analytically sophisticated use cases like Real-time Stream Processing (44%), Exploratory Data Science (33%) and Machine Learning (33%). And support for the more traditional customer intelligence (31%) and BI/DW (29%) use cases weren’t far behind. By adding those numbers up you can see that many organizations indicated that Spark was already being applied to more than one important type of use case at the same time – a good sign that Spark supports nuanced applications and offers some great efficiencies (sharing big data, converging analytical approaches).
 
Is Spark going to replace Hadoop and the Hadoop ecosystem of projects?  A lot of folks run Spark on its own cluster, but we assess mostly only for performance and availability isolation. And that is likely just a matter of platform maturity – its likely future schedulers (and/or something like Pepperdata) will solve the multi-tenancy QoS issues with running Spark alongside and converged with any and all other kinds of data processing solutions (e.g. NoSQL, Flink, search…).
 
In practice already, converged analytics are the big trend with near half of current users (48%) said they used Spark with HBase and 41% again also with Kafka. Production big data solutions are actually pipelines of activities that span from data acquisition and ingest through full data processing and disposition. We believe that as Spark grows its organizational footprint out from initial data processing and ad-hoc data science into advanced operational (i.e. data center) production applications, that it truly blossoms when fully enabled by supporting other big data ecosystem technologies.

…(read the full post)