A Different Way of Thinking About Data

I think that at some point in the near future the tools we use to consume data is going to change completely and totally from what we’ve grow accustomed to. They simply won’t be able to sift through the sheer volume of data that is being generated (which is only increasing). Eric Schmidt recently said that every two days we now create as much information as we did from the dawn of civilization up until 2003. Nobody’s ability to consume data has improved since 2003.

After thinking about this problem for a long time, I feel that the concepts themselves that will be used to think about data on the back end will change significantly. Instead of thinking about things like keys and indexes, we’ll be talking about things like similarity and focus.

Entities, as defined by “some data”, are effectively clusters of data that behave and are treated as one. They can be formed and ripped apart in real-time as data becomes available or is invalidated by better data. Because of this entities are effectively transient. However, instead of disappearing completely their data is simply sucked through a wormhole“>wormhole to the data location of the entity they’re merging into. This allows the data to flow back to the original location in the future if needed.1

Data Locations are spots in n-dimensional data space where entities reside. Entities can be in the same location in some dimensions and in other locations in others, just the same as you might be standing in the same exact place Napoleon stood at one point, just at a much different point in the time dimension. Or, take two points on a cube which are at the exact same place in the 2D x,y dimensions but at very different places when viewed in three dimensions.

Similarity, then, determines how close two pieces of data are to each other, based on the “important” data dimensions (as defined by cardinality). Above a certain similarity threshold they are considered to belong to the same entity, below it they are considered two distinct entities. This is effectively measuring how close the two entities are in “enough dimensions to matter”. This is going to be different in each instance and depending on the data available.

Focus, then, is the act of looking at more or less data. Loosening the filter parameters slightly. If you’re looking at a point in data space, decreasing the similarity you’re looking at expands the result set you’re looking at. Increasing the similarity tightens your focus on a spot, reducing the amount of data returned.

This changes the inputs to all sorts of things. Data integration points, data mining tasks, alerting, all kinds of things are touched by this idea.

I know I’ve been veering into the deeply geeky here, but I find this stuff fascinating, sorry! I’ll leave you with one more Schmidt quote from the same talk:

“I spend most of my time assuming the world is not ready for the technology revolution that will be happening to them soon”

  1. This has some interesting implications on graph databases. If there are no solid entities as such you’re indexing two different locations in n-dimensional space, not entities, which is not the way those databases operate at the moment.
Share and Enjoy:
  • Print
  • Digg
  • Facebook
  • Google Bookmarks
  • HackerNews
  • Reddit