MongoDB – Storing Big Data Documents

MongoDB – Storing Big Data Documents

If a Big Data set (or smaller data) is in the form of documents, then it’s difficult to store them in a traditional schema-defined row and column database. Sure, you can create large blob fields to hold large arbitrary chunks of data, serialize and encode the document in some way, or just store them in a filesystem, but those options aren’t much good for querying or analysis when the data gets big.

MongoDB Document Database

MongoDB is a great example of a document database. There are no predefined schemas for tables (the schema is considered “dynamic”). Rather you declare “collections” and insert or update documents directly into each collection.

A document in this case is basically JSON with some extensions (actually BSON – Binary encoded JSON). It supports nested arrays and other things that you wouldn’t find in a relational database. If you are object oriented, this is fundamentally an object store.

Documents added to a single collection can vary widely from each other in terms of content and composition/structure (although an application layer above could obviously enforce consistency as happens when MongoDB is used under Rails).

MongoDB’s list of key features is a fantastic mini-tutorial in itself:

  • Full Index support – You can create an index over any attribute with the same kind of advantages and constraints as a SQL database.
  • Document-based Queries – If a collection is named “users”, then instead of an SQL SELECT statement a MongoDB query might look like “db.users.find({name:/^Joe/}”). Yes, that’s a regexp right in the query (and it looks a lot like Rails ActiveRecord ORM).
    While you can’t “join” two tables together in Mongo for retrieval, you can query directly into nested JSON expressions within each document.  In practice document database are highly un-normalized!
  • Atomic Document-level updates/inserts and upserts – update or insert if not found.  (Does not support transactional consistency across documents.)
  • Built-in Map/Reduce (*Note difference here between MongoDB and CouchDB – Mongo provides an SQL-like functionality above for indexing/querying, with map/reduce still available for “group-by” like functionality.)
  • Auto-sharding – Declare the sharding key much like an index and MongoDb manages the chunking and balancing of the shards.

MongoDB Schema Design

Wait just a minute.  I said MongoDB doesn’t use schemas!

MongoDB doesn’t use schemas to define tables in which columns are pre-defined. As far as Mongo is concerned you can put any document of any kind into any collection at any time. But as a database designer and document-oriented data modeler, you still need to think about the grand schema of things for what you are doing at the application level.

Since in a document database there is no concept of “join”, the cornerstone of relational databases, there are two options for relating documents – embedding and linking.

  • Embedding – SQL database wizards may have a hard time with this but essentially each document can stand alone, with internal sub documents/objects stored right inside them. Think of a blog post database with a nested comment “tree” stored right in each blog post document. Very not normal sometimes, but efficient to store and retrieve if you always work with that set of things together.
  • Linking – Basically by emulating a foreign key in the document the application can retrieve a linked document with a subsequent query.  Reduces duplication when content is associated with many documents.  However the app is responsible for key management and the followup queries as necessary.

Best of many DB Worlds

Any correlated structured meta data can sometimes live (or also live) in a corresponding schema-structured traditional database. There are a lot of data environments that deploy multiple kinds of databases together to create a data environment “mashup” of sorts – leveraging the best type of database for each set of data.

Look for more applications and architectures based on the database type that makes the most sense, and not just force fit into a relational model. Even though there is a lot of inertia and investment into relational databases and SQL admins, Big Data will drive adoption of NoSQL solutions faster than most might expect. And once something like MongoDB is understood well within an organization, it will spread back to non-Big Data applications that would also be better served by a document database.

The same can be said about column-oriented, grid, and “wide-column” database alternatives, it is just that I think the biggest changes will happen with document/object stores first.