Your life so far has been a big data trail for someone else to mine.
Image by Xpectro via Flickr
Google took what was essentially crumbs of data left by millions (billions?) of people as they navigated around the internet, compiled and analyzed it into an index of how relevant and popular any place is that you want to visit. As they compile more bits of information about you and your social circles and browsing history (and recommendations and…), your lifetime becomes laid bare to their ultimately commercial interest.
Privacy is being hotly debated in some circles but most are not even aware of what is at stake. For some the world has evolved and we can no longer apply past expectations of privacy to constructs and capabilities emerging today – the new world is a shared one. For others, any data associated with their personal identification is off-limits.
There is a new huge privacy conflict dead ahead. Continue reading
Despite bigger and bigger data, the world is a small place and it is full of people. Increasingly networked people. I like Clay Shirky’s thinking in Here Comes Everybody about new ways people online can gather and form loose communities whose effectiveness is multiplied by new found freedoms and capabilities for distributed but coordinated group action. (Twitter doesn’t topple governments, people linked by Twitter do.)
In Cognitive Surplus he writes about the ability to harness huge untapped human potential. For example, the average Westernized civilization’s tuned-out TV time represents a significant amount of lost “cognition”. If it were possible to recover just a small percentage of that wasted human capital in the pursuit of just about anything, tremendous things could happen. Given the emerging abilities of internet societies to both encourage and allow everyone to contribute, we might be at the start of a tremendous acceleration in human achievement (e.g. see how online gamers solve aids protein puzzle).
Image by bass_nroll via Flickr
It is no longer news that companies can (and must) look for competitive advantage and innovative, even disruptive, opportunities in their “big data”. We are flooded daily with press releases about new big data technology, much of it designed to make the analysis and visualization of big data easier – even for the non-data scientist. You might even call 2011 the start of a renaissance for data visualization gurus and infographic artists. (And we are seeing data mining history being rewritten to cast any past complex analysis victory as a win for “big data”.)
But not that much is being said about the human psychology around big data analysis. Maybe a few cautionary stories about ensuring good design and not intentionally lying with big data stats (the bigger the data, the bigger the potential lie…). And some advice that the career of the future is “data scientist,” conflicting with emerging technology marketing hype indicating we won’t really need them.
The world is changing for the people who live here but we talk mostly about gadgetry.
Like anything that changes our mental paradigm it takes a bit of noodling to wrap your head around it. D3 is similar to basic jQuery with the twist that you can add and transform data attached to arbitrary DOM elements, then use that data to drive the visualization and behavior of the DOM dynamically.
D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document. As a trivial example, you can use D3 to generate a basic HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction.
Data Driven Functions
There are some clever things to be done in just a few lines of code when you use D3 to map what might normally be static attributes of your CSS/HTML/SVG (or other DOM elements) to data driven functions. D3 provides: Continue reading
If a Big Data set (or smaller data) is in the form of documents, then it’s difficult to store them in a traditional schema-defined row and column database. Sure, you can create large blob fields to hold large arbitrary chunks of data, serialize and encode the document in some way, or just store them in a filesystem, but those options aren’t much good for querying or analysis when the data gets big.
MongoDB Document Database
MongoDB is a great example of a document database. There are no predefined schemas for tables (the schema is considered “dynamic”). Rather you declare “collections” and insert or update documents directly into each collection.
A document in this case is basically JSON with some extensions (actually BSON – Binary encoded JSON). It supports nested arrays and other things that you wouldn’t find in a relational database. If you are object oriented, this is fundamentally an object store.
Documents added to a single collection can vary widely from each other in terms of content and composition/structure (although an application layer above could obviously enforce consistency as happens when MongoDB is used under Rails).
MongoDB’s list of key features is a fantastic mini-tutorial in itself: Continue reading
Image via Wikipedia
If you think of an HTML page as a structured “marked up” document, it’s basically a form of data. The structure, in this case represented by HTML tags like <a> and <div>, identifies various document elements which can be interpreted as data fields. In fact, strict HTML is good XML (referred to as XHTML), a well-understood data format.
Think of “documents” as structured data where the structure is included in the document itself. The structure is free-form in the sense that the document author decides what data fields are included, how they are organized, ordered, nested, related, and so on. If you are object-oriented, you can also view each document as an object (technically for web pages this is referred to as the DOM – “Document Object Model“).
Big Data in a Small World
Google perhaps lead the way into this brave new world with their proprietary “Big Table” database architecture backing many of their services. Apache’s Hadoop project (includes HDFS and HBase) is fundamentally based on Google’s open papers.
Most big data sets in our small world are going to be produced by numerous (countless?) authors and applications. Most big data is going to be in the form of documents rather than standardized (i.e. described by external schema) transactional data. Since most current data handling, storage, and analysis technology is aimed at transactional schema controlled data, I’m thrilled to explore today’s emerging market of new “big data” solutions.