Documents Are Data

Mike Matchett Small World September 6, 2011 | 0

A graphical despiction of a very simple xhtml ...

Image via Wikipedia

If you think of an HTML page as a structured “marked up” document, it’s basically a form of data. The structure, in this case represented by HTML tags like <a> and <div>, identifies various document elements which can be interpreted as data fields. In fact, strict HTML is good XML (referred to as XHTML), a well-understood data format.

Think of “documents” as structured data where the structure is included in the document itself. The structure is free-form in the sense that the document author decides what data fields are included, how they are organized, ordered, nested, related, and so on. If you are object-oriented, you can also view each document as an object (technically for web pages this is referred to as the DOM – “Document Object Model“).

Big Data in a Small World

Google perhaps lead the way into this brave new world with their proprietary “Big Table” database architecture backing many of their services. Apache’s Hadoop project (includes HDFS and HBase) is fundamentally based on Google’s open papers.

Most big data sets in our small world are going to be produced by numerous (countless?) authors and applications. Most big data is going to be in the form of documents rather than standardized (i.e. described by external schema) transactional data. Since most current data handling, storage, and analysis technology is aimed at transactional schema controlled data, I’m thrilled to explore today’s emerging market of new “big data” solutions.

big data Document Object Model Documents as Data Google Hadoop HTML Unstructured Data XHTML

Documents Are Data