When you start getting into multi-dimensional data analysis, you quickly run into something called the “Curse of Cardinality” (so yes, the post title is a geeky play on words). The gist of it is is that the more possible values there are for something, the quicker the index size grows, which means at some point you are storing far more data just to index those values than the original data itself. It’s why n-dimensional databases typically have hierarchies, they’re breaking down those possible values into digestable chunks.
This means that it takes very little data or storage to index things like sex, state, height, age, etc, because there are relatively few values for those. When you start getting into things that vary more (income, city, phone number, etc), indexing that data becomes prohibitively expensive.
This law, the curse of cardinality, came from computer science. It really has very little to do with the underlying statistics, which don’t really care about the cardinality much at all. It’s an equipment-imposed restriction that is holding back all kinds of progress.
In fact, after looking at this problem for a while, I think cardinality is actually a very important piece of information that can be used to change the way data is worked with.
For example, if I want to know what a user finds interesting about an entity, let’s say a person, there are a lot of options. We know his name, city, state, income, height, hair color and eye color. If the user indicates that she likes this person, which features of the person does actually like? This is a classic classification or recommendation problem.
Cardinality, in this case, is very helpful. If you know that income varies much more than state, you can assign more weight to it as you do your classification. Knowing that there is more variation in something makes it much more interesting, because that’s how you find the outliers. It’s virtually impossible to find outliers in say, the state that someone is living in, because there are only 50 choices. You can look at the unique factors around the thing that you’re looking at, and find more “like that”, leaving out the general, uninteresting stuff. This is the field of “similarity indexing”, which consists of a few research papers and whatever code is hidden away in skunkworks projects.1
All that to say that the data that varies the most is usually the most interesting, and the data that you probably actually want to be paying attention to. However, cardinality is typically dealt with as an obstacle to overcome rather than a potential asset. The problem is that the tools available today don’t actually allow you to use cardinality for anything. Good news is, I think that is going to change. The maturation of some of the Big Data platforms is going to allow the types of solutions that are needed to get past this limiting mindset.
The nascent field of stream mining is doing some very interesting things with real-time stream sampling which allow this type of analysis to be done accurately and in real-time.2 The problems are difficult and some of the solutions are still in the research stage, but those are the types of problems that are also the most interesting.
This kind of analysis, I’m convinced, is the future of data. It is all about recognizing what makes data interesting to you, and then giving you more of that.
- Fastbit is an implementation of some bleeding-edge ideas to get around the Curse of Cardinality, but it fails to use it to build a similarity index. Also, the license sucks. ↩
- One of the hardest things about working with cardinality is that you can’t work with the whole data set–it’s too big for real-time–and yet you want statistically valid results. This makes it very tricky to do things like estimate cardinality in real-time, and these are the kids of problems that you have to solve if you actually want software that does this. ↩