Big Data Part I: Sailing Past The Clouds

This is the first in a three part series on the processing and infrastructure of AddThis. Be sure to check out part two and part three.

Sailing Past the Clouds

The emergence and maturation of multi-tenant, virtualized, commodity computing (“the cloud”) has enabled entirely new business models by bringing large scale distributed computing at an affordable price point.  But by designing systems flexible enough to function well in most scenarios, but tackling no specific problem, cloud vendors have made architectural trade-offs that are particularly challenging to us.  Over the last five years, we’ve built a lower cost, more effective, more scalable solution that is closely tailored to the needs of a large scale data, low latency businesses like ours.  And in doing so we have gained control over performance, predictability and price.

A Little History

AddThis’ success in the Widget business, starting in 2006 / 2007, catapulted us directly into scale problems that early clouds of that time were not ready to handle.  At that time, our platform was providing real-time analytics for hundreds of thousands of widgets being viewed hundreds of millions of times a day.  It doesn’t seem like much compared to the billions of events we’re tracking today, but it was on par with the social giant of that time: MySpace.

Early decisions, which were formative to our thinking, were based on pragmatism.  We were small and the price to upgrade databases at the scale we needed was out of the question.  The performance of scripted / hosted commodity web stacks could not predictably meet our needs.  Before long, we were forced to replace our SQL databases with a much faster (and simpler) in-house developed noSQL store just to keep up with massive data growth (the term noSQL, like Hadoop itself, didn’t exist yet). We found we could get better scale with operational safety and huge cost savings by architecting around a few key learnings:

  • Databases can be fast, but network connections and serialization are slow
  • Transactional safety is untenable overhead for real-time anything
  • In-process local data CRUD is at least 1000 times faster
  • Have hot spares/replicas, stagger checkpoints
  • Backup only logs, not databases
  • Design around replay from logs

Continue to Part II: AddThis Today