I have a theory that buzzwords are usually helpful in general, in that they usher in new concepts before they end up as meaningless marketing fluff–and, eventually, punchlines. I think this is in the process of happening right now with the term “big data”, just like it did with “Web 2.0″ a few years back.
Remember the hype around “Web 2.0″ several years back? It was a buzzword that got run into the ground while the whole time 99% of people wondered exactly what the heck it was. Remember blog posts like this blast from the past?
In the end Web 2.0 ended up being an umbrella term for things like user-generated content, AJAX, REST API’s, and social networking. I suppose in a sense it did serve the purpose of alerting non-techie folks to the fact that some kind of shift in technology was happening, but the term itself became so nebulous it eventually turned into a punchline:
That doesn’t negate the fact that several useful things were brought to our collective attention from the Web 2.0 craze, but they went on to live on their own, taking off as more viable concepts around the time “Web 2.0″ was winding down:
(In case you’re unfamiliar with it, jQuery is the most popular library for doing rich client AJAX-y Web 2.0 things)
I’m predicting the exact same trajectory for the “Big Data” term. The kicker came when I was having a conversation with someone about it a few days ago and I used the term and he said “oh, I’ve heard a lot about that, what does it mean?” Talk about deja vu.
So what does “Big Data” mean, anyway?
There are a few key concepts that have ended up beneath the “big data” umbrella, all of them important. Probably just as important as everything under the Web 2.0 umbrella, but not as easy for a non-techie to grasp.
What is “Big Data”?
- Horizontally-scalable columnar data stores. These are newly-popular types of data stores which store data in a much simpler, flatter, and non-relational manner which allows data repositories to be scaled up by adding more servers, typically in on-demand computing clouds like Amazon’s. In the past (with relational databases) scaling up involved complex clustering configurations and replication. The drawback to these columnar data stores is that they do very little for you as a programmer aside from providing a place to put your data, which means that you have to spend much more time up front to use them because their schemas have to be pretty much hard-coded (and I do mean coded), and programming for them is not a simple as writing simple SQL queries (although this is slowly changing). Popular examples include Apache Cassandra, MongoDB, and the new Amazon DynamoDB, but there are many others.
- Distributed data analysis. The ability to analyze data as it comes in, and distribute that analysis across a cluster, is quite different from the traditional ETL process used by data warehouses. This is where Hadoop is getting popular, because it allows you to take each chunk of data you receive and send it to a cluster for detailed analysis. Being able to break up complex queries and run them across a cluster is much more efficient than running it in a single process, and this becomes very important if you need to analyze very large amounts of data. (Hadoop is not the only game in town for distributed analysis–for example I’ve been working with the Storm project that Nathan Marz of Twitter open-sourced, and I much prefer its architecture to Hadoop’s, and there are others).
- Data synergy and Augmentation. This is something that I’ve been thinking about a lot lately. It is the idea that the more data you add to your stockpile, the more valuable your existing data becomes. If you have the means to combine and overlay multiple data sets so that they feed off one another, the value of your data pool as a whole grows exponentially and the insights you can derive from it become much richer and more valuable.
- An improved ability to recognize patterns. The ability to store multiple data sets in one place and use distributed processes to analyze the whole allows you to do some interesting things with pattern recognition. Often patterns and trends only emerge as you add more and more data sets to the pool, which is why the ability to add an exponentially-growing amount of data to the equation is important in the first place. By teasing out the patterns in the cumulative data sets you begin to expose the real value in the data–the insights and revelations that weren’t possible before.
Of course this is just my attempt to pick apart an ambiguous buzzword, I’m sure others will have other interpretations of it. But regardless, I’m pretty sure the term “big data” will break up into more specific concepts at some point, once the buzzword trajectory has run its course.










