I’m Going Fission

I just spent a couple of weeks in Boston at Red Hat Summit and OpenStack Summit.  Containers are clearly the big thing this year – Kubernetes, Openshift, etc. And increasingly, IT is learning how to take advantage of remote Management As A Service (MaaS) offerings that free up folks to focus more on business value and less on running complex stacks. On that front I talked with folks like Platform9, who happen to also sponsor a “server-less” computing solution called Fission (- later in this blog post I’ll show how I got Fission deployed to my Mac).

Because I’m an industry analyst (in my day job), here is a big picture of the evolution happening in application infrastructure: Physically hosted apps (server and O/S) –> Virtual machines (in a hypervisor) –> Cloud platforms (e.g. OpenStack) –> Container “ships” (e.g. OpenShift, Docker, Kubernetes) –> Serverless Computing (e.g. AWS Lambda and Fission).

Applications have always been constructed out of multiple tiers and communicating parts, but generally we are moving towards a world in which functionality is both defined and deployed (distributable, scalable) in small, testable bits (i.e. “units” as in unit testing), while an application “blueprint” defines all the related bits and required service properties in operation.  Some folks are calling the blueprinting part “infrastructure as code”.

(BTW – the next evolutionary step is probably some kind of highly intelligent, dynamic IoT/Big Data/Distributed engine that inherently analyzes and distributes compute functionality out as far as it can go towards the IoT edge while centralizing data only as much as required. Kind of like a database query planner on IoT-size steroids).

So, onto my Mac deployment of Fission. I’ve already got VirtualBox installed for running Hadoop cluster sandboxes and other fun projects, but OpenStack is probably not something I really need or want to run on my own Mac (although apparently I could if I wanted more agility in spinning up and down big data clusters). But – Ah ha! – now a mental lightbulb goes on! (or rather, an LED went on – gotta save power these days).

This Fission project means I can run my own lambda services now too on my little desktop Mac, and then easily deploy really cool stuff to really big clouds when someday I create that killer app (with lambdas that happily interface with other coolness like Spark, Neo4j, Ruby on Rails…).  Ok, this is definitely something I want to play with.  And I’m thinking, wait for it –  Ruby lambdas!  (Ruby is not dead, you fools! You’ll all eventually see why Ruby is the one language that will be used in the darkness to bind them all!)

Well, we’ll come back to Ruby later.  First things first – we’ll start with the default node.js example. Let’s aim for a local nested stack that will run like this:

osx (-> virtualbox (-> minikube (-> fission (-> node.js))))

host server – hypervisor – container cluster – lambda services – execution environment

While the lambda execution will be nested, the CLI commands to interface with minikube/kubernetes (kubectl) and fission (fission) will be available locally at the osx command line (in a terminal window).

Ok, I’ve already got VirtualBox, but it’s out of date for minikube. So I directly download the latest off the web and install – oops, first issue! Mac OSX now has some fancy SIP security layer that prevents anyone from actually getting anything done as root (I swear if they keep making my unix-based Mac work like IOS I’m gonna convert to Ubuntu!). So after working around security to get that update in place (and thank you Oracle for VirtualBox) we are moving on!

$ virtualbox
Oh, and make sure to also have kubectl installed locally. The local kubectl will get dynamically linked into the minikube kubernetes environment that will be running inside virtualbox.
$ curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.6.0/bin/darwin/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
$ kubectl version

For the minikube install I used brew, which of course I had to update first. And of course, I had to again work around the Mac OSX SIP challenge above (hopefully this is a one time fix) by setting /usr/local directly ownership to myself (then back to root:wheel after the dust settled).

$ brew update
$ brew cask install minikube
$ minikube start 
# minikube stop
# minikube service [-n NAMESPACE] [--url] NAME
$ minikube ip
$ minikube dashboard

At this point you can deploy containerized apps with kubectl into the minikube “cluster”.  This next bit is an example of a simple “echo” server from the minikube github.

$ kubectl run hello-minikube --image=gcr.io/google_containers/echoserver:1.4 --port=8080
$ kubectl expose deployment hello-minikube --type=NodePort
$ kubectl get pod
$ curl $(minikube service hello-minikube --url)

(If you are following along, you might suggest that I should play here with minishift too, but now is not yet the time! Maybe I’ll climb into that PaaS arena in another post.)

Now it’s time for Fission. These next snippets are taken from fission github readme page. The first curl gets the fission client command lines installed locally. The kubectl lines start the fission services up. The two shell variables are just for convenience of the provided example, and not part of the required install.

$ curl http://fission.io/mac/fission > fission && chmod +x fission && sudo mv fission /usr/local/bin/

$ kubectl create -f http://fission.io/fission.yaml
$ kubectl create -f http://fission.io/fission-nodeport.yaml

$ export FISSION_URL=http://$(minikube ip):31313
$ export FISSION_ROUTER=$(minikube ip):31314    (for these examples)

Ok, so now have our own lambda services host running. Next we can start deploying lambda functions. Fission does a number of things like scale-out our services and keep a few containers ready for fast startup, and probably a bunch of stuff I won’t figure out until some O’Reilly book comes out (oh, I could just read the code…).

$ fission env create --name nodejs --image fission/node-env
$ curl https://raw.githubusercontent.com/fission/fission/master/examples/nodejs/hello.js > hello.js

$ fission function create --name hello --env nodejs --code hello.js
$ fission route create --method GET --url /hello --function hello

First, we create a fission environment associating a fission environment container image with the name “nodejs”. Then we create a fission function with our functional lambda hello.js “code” into that fission environment. Here we are using javascript and node.js, but there are other execution environments available (and we can make our own!). We also then need to map a web services route to our fission function.


module.exports = async function(context) {
    return {
        status: 200,
        body: "Hello, World!\n"
    };
}
hello.js

You can see that a Fission lambda function is just a javascript function. In this case all it does is return a standard HTTP response.

$ curl http://$FISSION_ROUTER/hello
 ->  Hello, World!

Testing it out – we hit the URL with a GET request and tada!  Hello World!

This is quite an onion we’ve built, but you hopefully can appreciate that each layer is adding to the architecture that would enable easy deployment at large scale and wide distribution down the road. Next up though, I personally want Ruby lambdas!

I could build a full native ruby fission environment (should be easy enough to start with an existing RH or docker ruby container). There is a python fission example that wouldn’t be hard to emulate. I’d have to decide on key gems to pre-load, and that leads to a big question on what I’d like to actually do and how big and fat that environment might get (which could be slow and bloated). Or we could try to stay very small – there have been small embeddable ruby’s like mruby (although that one looks dead since 2015). There is also some interesting advice for building minimal ruby app containers .

While not actually ruby, CoffeeScript transpiling ruby-like coffeescript code to javascript seems the easiest route at the moment, and just uses the vanilla fission node.js environment we already have above. I could also see also embedding “coffee” in a fission environment easily enough so that I could send coffeescript code directly to fission (although that would require transpiling on every lambda execution – it’s always a trade-off). To get started with coffee, add it to your local node.js environment (install Node first if you don’t already have that).

$ npm install -g coffee-script
$ coffee

Using coffee is easy enough. Learning it might take a bit of study, although if you like ruby and only suffer when forced to work with native Javascript, it’s well worth it.

But CoffeeScript is not ruby.  Something like Opal (transpiling full ruby syntax to js) is an even more interesting project, and if it was ever solid it could be implemented here with fission in a number of ways – possibly embedding it in a unique Opal ruby fission environment, statically applying it upstream from a node.js fission environment like with CoffeeScript, or even using it dynamically as a wrapper with ruby code sent to the node.js environment.

Another idea is to build a small ruby native fission solution with something like a nested ruby Sinatra design. First create a local “super-fission-sinatra” DSL that would deploy sinatra-like web service definition code to an embedded ruby/sinatra fission environment. Kind of meta-meta maybe, but maybe an interesting way to build scalable, instrumented API’s.

All right – that’s enough for now. Time to play! Let me know if you create any Ruby Fission examples!

Playing with Neo4j version 3.0

I’ve been playing again with Neo4j now that v3 is out. And hacking through some ruby scripts to load some interesting data I have laying around (e.g. the database for this website which I’m mainly modeling as “(posts)<-(tags); (posts:articles)<-(publisher)”).

For ruby hacking in the past I’ve used the Neology gem, but now I’m trying out the Neo4jrb set of gems. And though I think an OGM is where it’s at (next Rails app I build will no doubt be using some graph db), I’m starting with just neo4j-core to get a handle on graph concepts and Cypher.

One thing that stumped me for a bit is that with the latest version of these gems – maybe now that they support multiple Neo4j sessions – I found it helped to add a “default: true” parameter to the session “open” to keep everything down stream working at the neo4j-core level. Otherwise Node and other neo4j-core classes seemed to lose the current session and give a weird error (depending on scope?).  Or maybe I just kept clobbering my session context somehow. Anyway doesn’t seem to hurt.

require 'neo4j-core'
@_neo_session = nil
def neo_session
  @_neo_session ||= Neo4j::Session.open(:server_db,
    'http://user:password@localhost:7474',
    default: true)
end
#...
neo_session
Neo4j::Node.create({title: "title"}, :Blog)
#...
Neo4j-core Session

The Neo4j v3 Mac OSX “desktop install” has removed terminal neo4j-shell access in favor of the updated slick browser interface. This updated browser interface is pretty good, but for some things I’d still really like to play with a terminal window command shell.  Maybe I’m just getting old :)… If you still want the neo4j shell, apparently you can instead install the linux tarball version (but then you don’t get the browser client?). I’m not sure why product managers make either-or packaging decisions like this. It’s not as if the shell was deprecated (e.g. to save much dev, time or testing effort).

Anyway, things look pretty cool in the browser interface, and playing with Cypher is straightforward as you can change between table, text, and graph views of results with just a click.

Screen Shot 2016-06-07 at 3.44.03 PM I’ve also been wanting to play with Gephi more. So I’m exporting data from Neo (using .cvs files though as the Gephi community neo4j importer plugin isn’t yet updated to Gephi v0.9) using Cypher statements like these and the browser interface download button.

#for the node table export -> Gephi import
MATCH (n) RETURN ID(n) AS Id, LABELS(n) AS Label, n.title As Title, n.url AS URL, toString(n.date) as Date, n.name AS Name, n.publisher AS Publisher

#for the edge table export -> Gephi import
MATCH (l)-[rel]->(r) RETURN ID(rel) AS Id, TYPE(rel) AS Label, ID(l) AS Source, ID(r) AS Target
Cypher Queries for Importing into Gephi

Data Stream Mining with Cube

Time-series data analysis can be approached in two ways. Traditionally time-series data is aggregated into partitioned historical data bases, and then reported on at scheduled intervals. Commonly, reports delivered today cover data collected yesterday. A modern (and perhaps most relevant to Big Data) approach is to recognize that time-series data just “keeps coming”. And since the timeliest analysis could theoretically deliver the most value, visualizations should update as soon as the data streams in.

Square’s evolving Cube library (it’s still early version 0) enables web developers to easily deliver real-time charting of streaming time-series data on dynamic web pages:

Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards.

I’ve spend a large chunk of my professional life working at IT system management vendors, each of whom spent significant resources to build and deliver proprietary event and time-series data analysis and visualization tools. In the last few years there have been successful open source discrete event monitoring and management tools (threshold, alert, etc) that really disrupted the market of old school proprietary event solutions.  Open source time-series solutions like Cube have similar potential to disrupt proprietary time-series analysis markets.

Time-Series Data Stream Mining

Real-time time-series visualization is fundamentally data stream mining, maybe not at Big Data scales but certainly there are some hints about the future for Big Data stream mining in the way Cube is architected. Continue reading

What is the Question?

The answer I’m sure is innovation.

Practically the first thing to do is figure out the questions to ask. Don’t stick to just questions that are hanging out there already needing to be answered, but create new questions that you couldn’t answer before you had your Big Data. Don’t forget that the data you have isn’t limited to what’s in-house, you can find and mashup “tons” of public, government, and licensed data sets.

Data mining, just like data visualization, is as much art as science…

When You Have a Traditional Question, All Data Looks Traditional

Mine near Woodburn, Oregon

Old Mine - Image by OSU Special Collections & Archives via Flickr

Is the challenge simply to map and reduce the Big Data into smaller data so we can look at it the same way we always have? So we can support the same business processes, the same decision-making? Answer the same questions but at larger scale perhaps?

The real challenge to think differently – to ask different questions that can only be answered by unlocking the Big Information spread over the Big Data. The whole process from data gathering through mining, analysis and visualization and presentation needs to be designed to help create and answer these new and different questions.

Enhanced by Zemanta

Why Didn’t We Already Find What We’re Looking For?

building the data plotter

Image by !mz via Flickr

What we primarily look for in data is to make sense of it – find summaries and statistics to help inform analytical decision-making or discover patterns and stories creating new insights into the larger world behind the data.

This should all sound familiar if you are a Flowing Data blog fan as I am.  From author Nathan Yau in his book Visualize This – “Whatever you decide visualization is… you’re ultimately looking for the truth.” But the truth is hard to come by. Basically numbers don’t lie, people do – either on purpose or through incompetence.

Most of us have probably read How to Lie with Statistics, but with Big Data the dangers are multiplied by magnitudes. Search for the truth, always try to tell the truth but beware of people saying they have the big truth.

Big Data Visual Exploration

There are lots of tools to analyze and visualize non-Big Data (smaller data?).  But when we approach Big Data our options are almost by definition limited. In fact most definitions of Big Data are in terms of the constraints of current “smaller data” tools to handle it effectively.  What we do have currently is centered around map/reduce processing (see Hadoop) that essentially first makes smaller datasets for analysis (e.g. check out the free Infobright/Pentaho VM).

This map/reduce approach requiring low-level distributed programming isn’t well suited to serendipitous discovery by amateur data scientists, although there is ongoing work in this area (see Pig and Hive). There are also emerging companies specializing in automating the deep “data scientist” geekery to provide a “small data” exploration experience over Big Data sets (Opera Solutions, still stealthy Zillabyte?).

The real challenge is still that we don’t really know what we are looking for in Big Data sets before we find it – discovery more than answers to questions. And whatever it is, it probably wasn’t in the smaller data we already have made optimal use of (or not, most data goes unexamined even in non-big databases.).

Enhanced by Zemanta

Big Data Analytics – Intelligence for Disruption

Taken together, the V-word characteristics of Big Data both identify and shape the kinds of innovative solutions that can be created from Big Data opportunities.  These solutions will tend to provide intelligence more than absolute truth.

Disruption is the Real Opportunity

Hurricane Irene Makes Landfall in North Carolina

Image by NASA Goddard Photo and Video via Flickr

It’s worth keeping in mind that adding Big Data Analysis to a current business isn’t the whole enchilada. Having better intelligence than the next guy is a great competitive advantage, but in itself isn’t “disruptive.” The idea that Big Data will enable game-changing new business opportunities, not simply adding insight into current processes or decision-support practices, is why Big Data Analysis is exciting.

Entrepreneurs who create new ways of doing business fueled by Big Data intelligence will dominate. The key to the difference between improving current business and innovative disruption is looking for answers to new and different questions. Sounds easy enough but that is truly difficult creative work.

Big Data Doesn’t Come with an Instruction Manual

Big Data sets don’t start with a schema model that defines the answers “findable” within them. It’s not just a huge BI warehouse. Rather, it takes a cunning mind and a dedicated soul to explore through Big Data – for example trying various map/reduce algorithms to find new patterns and assembling new visualizations discovering new ways of looking and seeing.

This skilled data mining and keen perceptive ability must be fused with an entrepreneurial mindset that is always evaluating how any new big data intelligence could be formed into new and ultimately disruptive innovation.

Big Data Defined by the V’s

There are lots of definitions of Big Data. Most of them are fuzzy marketing speak along the lines of “Big Data is just bigger than your old data, too big to deal with the same way you dealt with data before.”  Amusingly a lot of examples being given for “historical” Big Data successes are based on traditional data methods and technologies applied overlarge amounts of traditional data.

Data Represented in an Interactive 3-D Form

Image by Idaho National Laboratory via Flickr

Clearly there is something new happening with the way we can get value out of very large data sets, but it’s really hard to see what the line really is between Big Data and not-so-Big Data. Ironically most pundits seem to be saying we can spot Big Data the same way we know what’s obscene  – we’d simply recognize it when we see it. The irony of course is that Big Data is just too big to see, or visualize as it is.

Think how big a picture it would take to show a 5 Pb Big Data set at one pixel per data point.

Big Data by the V-words

I’ve read more than a few definitions that talk about some clever V-word characteristics that Big Data scientists need to be concerned with:

  1. Volume – Obviously Big Data is Big.
  2. Variety – Many identified Big Data sets are internally heterogeneous (e.g. big data documents).  The data isn’t collected or authored according to a single master schema.
  3. Velocity – Big Data sets tend to grow rapidly, even as we use them.  Implies some dynamic and possible real-time behavior as well.

I’d add a fourth V:

  1. Veracity – Or rather, the lack thereof.  Raw Big Data is often not verifiable/verified nor validated (until processed for that goal specifically, e.g. security fraud). Analysis can’t always be duplicated (as data keeps growing/changing). Duplication, omission, and general incompleteness are to be expected.

It may be impossible to repeat the same analysis definitively on a truly big “big data” set.  If results can’t be exactly reproduced (or explained back to raw data), they can’t serve as literal truth.

Enhanced by Zemanta

Nowhere To Hide

Your life so far has been a big data trail for someone else to mine.

Trail

Image by Xpectro via Flickr

Google took what was essentially crumbs of data left by millions (billions?) of people as they navigated around the internet, compiled and analyzed it into an index of how relevant and popular any place is that you want to visit.  As they compile more bits of information about you and your social circles and browsing history (and recommendations and…), your lifetime becomes laid bare to their ultimately commercial interest.

Privacy is being hotly debated in some circles but most are not even aware of what is at stake. For some the world has evolved and we can no longer apply past expectations of privacy to constructs and capabilities emerging today – the new world is a shared one. For others, any data associated with their personal identification is off-limits.

There is a new huge privacy conflict dead ahead. Continue reading

It Is a Small (and connected) World

Despite bigger and bigger data, the world is a small place and it is full of people. Increasingly networked people. I like Clay Shirky’s thinking in Here Comes Everybody  about new ways people online can gather and form loose communities whose effectiveness is multiplied by new found freedoms and capabilities for distributed but coordinated group action. (Twitter doesn’t topple governments, people linked by Twitter do.)

In Cognitive Surplus he writes about the ability to harness huge untapped human potential. For example, the average Westernized civilization’s tuned-out TV time represents a significant amount of lost “cognition”. If it were possible to recover just a small percentage of that wasted human capital in the pursuit of just about anything, tremendous things could happen. Given the emerging abilities of internet societies to both encourage and allow everyone to contribute, we might be at the start of a tremendous acceleration in human achievement (e.g. see how online gamers solve aids protein puzzle).

It Is a Small World After All

small world #5

Image by bass_nroll via Flickr

It is no longer news that companies can (and must) look for competitive advantage and innovative, even disruptive, opportunities in their “big data”. We are flooded daily with press releases about new big data technology, much of it designed to make the analysis and visualization of big data easier – even for the non-data scientist. You might even call 2011 the start of a renaissance for data visualization gurus and infographic artists.  (And we are seeing data mining history being rewritten to cast any past complex analysis victory as a win for “big data”.)

But not that much is being said about the human psychology around big data analysis. Maybe a few cautionary stories about ensuring good design and not intentionally lying with big data stats (the bigger the data, the bigger the potential lie…). And some advice that the career of the future is “data scientist,” conflicting with emerging technology marketing hype indicating we won’t really need them.

The world is changing for the people who live here but we talk mostly about gadgetry.

Enhanced by Zemanta