At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality is large.

To better understand the challenge of determining the cardinality of large sets let’s imagine that you have a 16 character ID and you’d like to count the number of distinct IDs that you’ve seen in your logs. Here is an example:

`4f67bfc603106cb2`

These 16 characters represent 128 bits. 65k IDs would require 1 megabyte of space. We receive over 3 billion events per day and each event has an ID. Those IDs require 384,000,000,000 bytes or 45 gigabytes of storage. And that is just the space the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume only 1 in 3 records is unique the hash set would still take 119 gigs of RAM not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day’s unique IDs. We certainly don’t have a single machine with several hundred gigs of free memory sitting around so we need a better solution.

One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count. If we want to count to one hundred million using a bitmap you will need one hundred million bits or roughly 12 megabytes for each counter. Sparse bitmaps can be compressed in order to gain space efficiency but our bitmaps are not sparse. It is also not uncommon for our jobs to contain hundreds or even thousands of counters so we require a more space efficient solution.

Luckily for us cardinality estimation is a popular area of research. We’ve leveraged this research and implemented, and open sourced, several distributed probabilistic cardinality estimation algorithms. We will not go into the details of how these algorithms work in this post but if you are interested you can find references to the relevant research papers in our GitHub project. In general these algorithms provide a space efficient mechanism for estimating the cardinality of a set but with less than perfect accuracy. They do this by representing the count as a set of bit fields and use some hashing function to determine the relevant bucket and bit for the element being added to the set. One algorithm, the LogLog cardinality estimation, is amazingly space efficient. Using LogLog you can count up to 100 million distinct elements with 4% error using just 640 bytes.

So now that we can count the number of distinct elements in huge sets using probabilistic cardinality estimators we have to solve for another problem. Clearspring has close to 2 petabytes of storage spread across several hundred servers. We store partitions of our data on each server so in order to get the cardinality of the full data set we need to combine the results from each of the servers. Let’s look at an example. Imagine that we have 3 servers and the table below represents the cardinality for the data on each of those servers:

Machine |
Cardinality |

1 | 1000 |

2 | 1500 |

3 | 2500 |

We cannot simply add up the cardinality from each server and say that our total cardinality is 5000. That would assume that the data is partitioned in such a way that guarantees that no ID appears on more than one machine. If that condition is not true, as is often the case, we would be over estimating the cardinality by summing the results. This is where the true power of cardinality estimators really shines through. In addition to being space efficient, the cardinality estimators we use are also mergeable. This means that we can take the cardinality estimate, which is represented as an array of bits, from multiple machines and merge them together to get a cardinality estimate for the global set.

If estimating the cardinality of large sets is a problem you have then we recommend using probabilistic counters and if you happen to use Java then our stream-lib project is great place to start.

Want to work on problems like these? We are hiring.

Pingback: Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data | :: Metamarkets Group ::()

Pingback: Probabilistic Data Structures for Data Analytics « BigSnarf blog()