Category Archives: Developers

Processing Our Data

Currently taking place on the Clearspring blog is a four part series about how our team processes tens of billions of unique, new data points on a daily basis.

In storage terms, we ingest 4 to 5 TB of new data each day; that could easily double in 12-18 months. That data must be processed as live streams, re-processed from the archive as new algorithms are developed and able to be queried in flexible ways. An interesting mix of batch, live and hybrid jobs are employed.

So head on over to the Clearspring blog to learn all about the wizardry!

Clearspring’s Big Data Architecture, Part 1

This is the first post in a three part series that will describe the data analytics architecture Clearspring has created to process large amounts of semi-structured data.  Check out post 2 and post 3 when you’re done reading this post.

On a daily basis, the Clearspring network yields tens of billions of new, unique data points. In storage terms, we ingest 4 to 5 TB of new data each day; that could easily double in 12-18 months. That data must be processed as live streams, re-processed from the archive as new algorithms are developed and able to be queried in flexible ways. An interesting mix of batch, live and hybrid jobs are employed. To keep pace with a rapidly expanding network, the architecture must not only efficiently handle the current set of tasks, but it must also enable significant growth while guarding against hardware and software faults. To satisfy these needs, we built an entire software suite that leverages open source technologies and many in-house software components.


Desired System Characteristics and Challenges

  1. Efficient and reliable storage of large amounts of data on unreliable machines in a distributed system with fast writes, data partitioning and replication, and data streaming capabilities.
  2. Knowledge discovery.  Our system consumes a large amount of raw data that in and of itself is not very interesting so we need to be able to extract the interesting information by analyzing large streams of data from multiple sources.
  3. Management of distributed resources.  We wanted the system to distribute jobs in the cluster, determine when it is safe to execute the job, report failures, and provide a nice GUI to visualize the current state of the cluster.
  4. Speed.  The small things add up when you do them over a billion times.  For example, if we have 2.5 billion messages and we shave 1 millisecond (.001 seconds) off the time it takes to process a single event we will have saved nearly 29 days of computation time.  In our case, milliseconds matter!

There are many systems available that provide some of the characteristics listed above, including  Enterprise SANsHDFSOpen Stack (Swift),CouchDBRedisElephantDBVoldemortHadoopCassandra, and others.   While we do use some of these projects for various components of our architecture, as an example we use Cassandra for our distributed counting system that keeps track of how many times a URL is shared and serves over 200M view requests daily, the primary system we use for our distributed processing needs is a project we call Hydra.



The Hydra system we created internally has four main functional areas:
  1. Capability to partition, store, and stream data in a distributed system.
  2. Software that takes an input data stream, processes that data (filtering, counting, membership detection, etc) and then stores the batch output in Berkley DB JE which uses sequential writes and indexes the batch output for fast reads.
  3. A set of software modules that enable ad-hoc queries against tree-based data structures.
  4. Command and control subsystem responsible for monitoring and executing jobs in the cluster.


Streaming and Splitting Data

We have too much raw data to store on a single machine and we were not interested in purchasing an enterprise SAN capable of storing petabytes of data.  While we looked at several distributed file systems like HDFS, GlusterFS, and others we ended up going with a standard ext3 filesystem and implemented our own distribution and replication framework on top of that standard file system. There are two standard modules, Splitters and Stream Servers.


The Splitter consumes an input stream from a data source.  The data source could be a flat file on the local system, an input stream sourced by a Stream Server, a firehose of data exposed via an API, or any other source you can think of.  The format of the source data from the stream is described to the splitter (in the near future we will be using Zookeeper to store stream formatting information).  Each splitter consumes a sub-set of the data based on a hash function. The Splitter then uses the stream format information and the job configuration script (usually a JSON file) to filter, transform, and create data elements which will be stored in n partitions on the local file system.  After a checkpoint is reached (all available input data has been consumed or some time limit has been reached), the files the Splitter stores are made available to other processes in the cluster via the Stream Servers.

Splitter processes are responsible for ensuring that they do not reprocess data that has already been consumed.  They do this by storing meta data in their local job directory that indicates which files they’ve consumed and an offset indicating how far into those files the process has read.  Because we only append to log files and never modify existing content the splitter process doesn’t need to re-read data it has already processed when more data is appended to a file.  We feel having each client track its own state is a better approach than having the StreamServer attempt to track the current state of every consumer in the cluster.

This simple setup can accomplish some very powerful things beyond the obvious capability of storing and partitioning raw data to a local node in the cluster.  One useful function the Splitter provides is recombining many small files into a smaller number of time sorted files.  This can be accomplished by having n splitter nodes consume from m sources where n < m.  All of the data from the data sources will be consumed but they will be persisted into a smaller number of files than originally required.

Stream Server

Stream Servers provide access to the files stored on one node to any other node in the cluster. The Stream Server transmits raw bytes to clients because they may be in a compressed format. This limits the the number of times the data in the file is decoded and  reduces the total amount of data transmitted over the network. This approach does prevent the Stream Server from applying fine grained filters to the individual log lines. However we think that this limitation is acceptable given CPU cycles saved by only decoding the file once on the client side and the massive reduction in network IO.

Requests for data come into the Stream Server in the form of regular expressions which describe which files the client would like to receive.  Here is an example of what a request might look like:



The snippet above would tell the Stream Server to find all files in any directory under ‘job/1/*/live/split/us’ that has a file that matches the regular expression defined in the match field.  The date format will be replaced at run-time with the set of dates the client is interested in and the mod variable will be replaced with the modulus for the client.  This approach enables a client to stream data for a specific partition and time range without requiring complex filtering on the server side.



This was part one in a series of posts where we will explain how Clearspring processes and stores multiple terabytes of data daily.  In upcoming articles we will explore how and why we use a tree based data structure, our distributed query system, and the command and control infrastructure that binds it all together.  Check out post 2 and post 3 to learn more.  Subscribe to our RSS feed or follow us on twitter to stay informed.


[BUG] The Facebook Like Button Scrollbar Issue

You may have noticed your Facebook Like buttons have a little something extra today: A scroll bar.

Facebook Like Button with Scroll Bar

Turns out Facebook changed the way they’re displaying the Like button and since we just pass it along – with a little special sauce to capture them for our analytics – it’s causing a problem for our users.

Luckily, we’ve got a couple solutions for this until we can put out an official fix, which should be soon.

First, you can make sure you have Facebooks FBML namespace in your <html> tag. For example:

<html xmlns:fb="">

You might have other namespaces in the HTML tag. Just make sure everything it includes xmlns:fb=”” before the greater-than sign.

Second, you can alter the code for the AddThis Facebook like button to specify the appropriate height by adding an fb:like:height attribute, like this:

<a class="addthis_button_facebook_like" fb:like:height="25">

If you have any questions regarding this issue please post them in the forum topic or feel free to email us at

Thanks in advance for your patience, and thanks for using AddThis!

New Open Source Stream Summarizing Java Library

Here at Clearspring we like to count interesting things. How many users have visited a site? How many unique URLs are shared every hour? It turns out that counting is a non-trivial problem when there are several billion things to count. And counting is only the first step. What about the frequency of the things you are counting? Maintaining a complete multiset — with billions of elements indexed by multiplicity — for each dimension of interest is rarely practical. So, we’ve developed a set of utilities to help make counting to a billion easy.

Today we are pleased to release those utilites under an open-source license. Stream Lib is a Java library for summarizing streams of data. Included are classes for estimating:

  • Cardinality (aka “counting things”): Instead of storing an entire set of elements it is possible to instead construct a compact binary object that will provide a tunable estimate to how many distinct elements have been seen. This reduces memory requirements by orders of magnitude. There is a significant body of academic literature on approaches to this problem. We have tried to provide useful implementations of those ideas.
  • Set membership: Bloom filters provide a space efficient way to test for set membership. They have the useful property that their are no false negatives, only false positives within whatever bounds you specify. The wikipedia page has a good section on the interesting variants that have been developed over the recent years. We have adapted Apache Cassandra’s well tested implementation for standalone use.
  • Top-k elements: While counting the number of distinct elements with a cardinality estimator is cool, sometimes (well often) you also want to know something about those elements (such as the most frequent ones). We have some early work in this area (a stochastic topper).

There is a Readme to get your started with the code. We hope that others find this as useful as we have. Feedback, comments, patches, bugs, and forks are all welcome.

Think this is cool? Apply for a job to join the team!

New Data Feeds and Analytics API

I’m excited to announce two new ways to put your sharing data to work for you.

First up, our new Content Feeds let you access RSS and JSON feeds of your trending, most shared and most clicked content. Integrate these feeds into your web pages to surface your top content to users, keeping them engaged and keeping them on your site. Once enabled, these feeds can be publicly accessed with no rate limits, making them a great complement to our Analytics API. The options will let you customize your feed for a particular time period or domain, or get service-specific content like “most shared to Facebook”.

Second, we’ve expanded our Analytics API to serve up audience data on your sharers, clickers and influencers and insights into their interests like Travel, Sports and Finance. These new metrics help you answer interesting questions like “what are my influencers interest in” and “which sharing service is sending the most clickers back to my site”. You can also access data on which search terms and referring domains generated the most shares on your site. This update brings the Analytics API up to speed with the audience insights reports we released in December.

We hope you enjoy these new analytics features, we have more on the way. If you come up with some cool use cases, drop us a note in the comments.

AddThis for Developers

We recently launched an AddThis for Developers page to encourage coders to get even more out of our sharing tools. In particular, we wanted to highlight our APIs and make them more discoverable:

Services API
The Services API provides a list of our supported sharing services, along with their icons and service meta-data in JSON, JSONP and XML formats.

Client API
The Client API is a client-side JavaScript API that allows you to customize the appearance and behavior of AddThis tools on your pages – with analytics.

Analytics API
The Analytics API is a set of web services that give you programmatic access to your analytics data. Integrate our metrics into your own tools and work-flow.

Also, if you are a developer who has created something awesome using the AddThis APIs, like the user who developed an AddThis plug-in for Opera, please send it our way and we’d love to feature it.

AddThis Developer Turns Into Speaking Machine

I’m excited to announce that I’m going to be giving three speeches in the week on completely different ends of the country. Some might call that crazy, but I’m going to call it a good time. Come say hi at either of these events and ask me any questions you have about AddThis and how to get the most out of our analytics or increase sharing on your site.

Later today I’ll be taking part in a panel discussion with the Web Content Mavens here in DC about WordPress and Drupal. Then this weekend I’ll be at WordCamp Phoenix where I’ll be giving two presentations. The first one, during the lightning talk session on Saturday, is about improving communication between developers and non-developers. I’ll then be speaking and helping run the dev day on Sunday. If you’re at either of these events, I’d love for you to come by and say hi. I’ll have some Clearspring stickers with me as well, all you have to do is ask for one.

Oh, and if you’re a WordPress user, you might want to stay tuned to this blog and our twitter stream . We’re going to have some exciting news for you very soon!

Using HTML 5 to Make the Analytics Charts

What happened when you attempted to access the old analytics on an iPhone
What happened when you attempted to access the old analytics on an iPhone.

When we decided to redesign and update our analytics, one of the first things we looked at was updating our charts. Our old charts were Flash-based and as such didn’t work on iPhones or iPads and tended to load a little slow. We wanted to fix that with this release.

I started by looking a wide range of options. The number of options for charts is staggering. In the end, we decided to implement a mixture of Flot for the daily line charts you see at the top of most pages, CSS for our bar charts, and Google Charts for our maps. Let’s see how we specifically implemented each of these.
Continue reading

Best Practices Guide for OExchange Targets

OExchange has a new Best Practices Guide for service providers, it has some suggestions for avoiding common UX and functional bugs we’ve seen on sharing sites over the years:

  • Welcome new users on your Offer page
  • Preserve URL parameters through the login process
  • Check your URL encoding and decoding
  • Don’t resize the browser window

These are great tips for any sharing site, they’ll improve the user experience and make it easier to share.  The guide also shows you how to you test your site using the OExchange tools, check it out:

We’re going to roll out improved Service Directory support for OExchange-compatible services in the next month or so, stay tuned!