Probabilistic Counting

At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality is large.

To better understand the challenge of determining the cardinality of large sets let’s imagine that you have a 16 character ID and you’d like to count the number of distinct IDs that you’ve seen in your logs. Here is an example:

4f67bfc603106cb2

These 16 characters represent 128 bits. 65k IDs would require 1 megabyte of space. We receive over 3 billion events per day and each event has an ID. Those IDs require 384,000,000,000 bytes or 45 gigabytes of storage. And that is just the space the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea is to use an in memory hash set that contains the unique list of IDs seen in the input files. Even if we assume only 1 in 3 records is unique the hash set would still take 119 gigs of RAM not including the overhead Java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements this way and that is only to count a single day’s unique IDs. We certainly don’t have a single machine with several hundred gigs of free memory sitting around so we need a better solution.

One common approach to this problem is the use of bitmaps. Bitmaps can be used to quickly and accurately get the cardinality of a given input. While bitmaps drastically reduce the space requirements from the naive set implementation described above they are still problematic when the cardinality is very high and/or you have a very large number of different sets to count.  If we want to count to one hundred million using a bitmap you will need one hundred million bits or roughly 12 megabytes for each counter.  Sparse bitmaps can be compressed in order to gain space efficiency but our bitmaps are not sparse. It is also not uncommon for our jobs to contain hundreds or even thousands of counters so we require a more space efficient solution.

Luckily for us cardinality estimation is a popular area of research. We’ve leveraged this research and implemented, and open sourced, several distributed probabilistic cardinality estimation algorithms. We will not go into the details of how these algorithms work in this post but if you are interested you can find references to the relevant research papers in our GitHub project.  In general these algorithms provide a space efficient mechanism for estimating the cardinality of a set but with less than perfect accuracy. They do this by representing the count as a set of bit fields and use some hashing function to determine the relevant bucket and bit for the element being added to the set. One algorithm, the LogLog cardinality estimation, is amazingly space efficient. Using LogLog you can count up to 100 million distinct elements with 4% error using just 640 bytes.

So now that we can count the number of distinct elements in huge sets using probabilistic cardinality estimators we have to solve for another problem. Clearspring has close to 2 petabytes of storage spread across several hundred servers. We store partitions of our data on each server so in order to get the cardinality of the full data set we need to combine the results from each of the servers. Let’s look at an example.  Imagine that we have 3 servers and the table below represents the cardinality for the data on each of those servers:

Machine

Cardinality

1 1000
2 1500
3 2500

We cannot simply add up the cardinality from each server and say that our total cardinality is 5000. That would assume that the data is partitioned in such a way that guarantees that no ID appears on more than one machine. If that condition is not true, as is often the case, we would be over estimating the cardinality by summing the results. This is where the true power of cardinality estimators really shines through. In addition to being space efficient, the cardinality estimators we use are also mergeable. This means that we can take the cardinality estimate, which is represented as an array of bits, from multiple machines and merge them together to get a cardinality estimate for the global set.

If estimating the cardinality of large sets is a problem you have then we recommend using probabilistic counters and if you happen to use Java then our stream-lib project is great place to start.

Want to work on problems like these?  We are hiring.

  • http://www.quick-locksmith.com/ Nicholas

    I dont know if I am just confused, but what exactly is this concept of cardinality used for?

  • Matt Abrams

    Nicholas, you can read more about the concept of cardinality here: http://en.wikipedia.org/wiki/Cardinality. The basic idea is we want to count the number of unique elements in a set. If you take web access logs as an example, each log line will include the URL of the page that caused the access event. Imagine that you want to count the number of unique URLs in those access logs. You may have a 100 lines in your log but only 5 unique URLs visited. So in this case the cardinality of unique URLs would be 5.

  • http://www.quick-locksmith.com/ Nicholas

    ahhh ic, that cleared things up a bit for me.

  • http://www.cheapghdoutletaustralia.com cheap ghd outlet

    Cool! I like this article!

  • http://www.thatcouponblog.com Stephanie

    Well I learned something new today. I’m going to read some more about your stream-lib project.

  • http://www.ashwinjayaprakash.com/ Ashwin Jayaprakash

    How is this diff from a Bloom Filter?

  • Pingback: Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data | :: Metamarkets Group ::

  • Pingback: Probabilistic Data Structures for Data Analytics « BigSnarf blog

  • devi

    as we are doing project on packet capturing please help me for program..

  • devi

    pls hlp how to program for packet capturing

  • http://www.3vbizsolutions.com/2013 Michael Lucy

    Curious as to how a post on cardinality ended up in the AddThis blog? How did you determine that a data set with a yearly volume 3 billion unique entries requires a 16 character data point? Was that part of the original math?

  • Matt Abrams

    Hi Michael. The data set is 3B a day not 3B a year. This is the data generated from our AddThis button hosted on 14M+ domains. The unique elements we are counting are UIDs. In our raw data feed we see 5B+ unique UIDs per month. This set is filtered to remove bots and other non-humans to get to our monthly unique user count.