This is the last in a three part series on the processing and infrastructure of AddThis. Make sure to check out part one and part two.
In the end, our decisions rest on cost and control. For our uses, CPU cycles in the cloud are about 3x more expensive than the fully loaded cost of dedicated cycles in our own data center. Data stored in the cloud is at least 10x more expensive than in our own clusters. Even if data storage were free, we would still be paying for CPU cycles to process the data and network costs to expose it. But the killer is still latency and IO bottlenecks. At our scale, there aren’t meaningful pricing options to overcome these in the cloud.
As both our business and cloud offerings have matured, we would still make the same choices around internal build-out vs shared cloud. If there is one reason that drives this decision, it’s the fact that we have a business in which our needs can be modeled and anticipated. And with the ability to predict comes the ability to optimize resources for both cost and performance. With our own build-outs, we are able to regain control and apply it to the bottom line. But to achieve this goodness requires a clear understanding of your needs, a skilled dev/ops team, a hacker mentality and focus.
- clouds remain king for spot capacity, typical web apps, development
- they do not solve complex management problems or magically scale apps
- operations at even modest scale still requires skilled, creative staff
- once a baseline of capacity is known, dedicated hardware can’t be beat
Depending on the profile of your business, cloud computing could be a game changer. Our decision to bypass the traditional cloud was game changing for us; we are noticeably stronger, faster and larger than our competitive set. And with our developed abilities in large scale data and processing, we are in control of our destiny, ready for what comes next.
If you have any questions or want to hear more about our data processing capabilities, please feel free to reach out to us.
This is the second in a three part series on the processing and infrastructure of AddThis. Make sure to check out part one and part three.
Fast forward to today. Our data infrastructure supports AddThis code deployments on more than 14 million domains, ingesting billions of events and terabytes of logs daily. Scores of machines has grown to around a thousand totalling tens of terabytes of RAM and petabytes of disk space.
The majority of our internal compute capacity is dedicated to our custom data processing platform. This platform is highly dependent on large IO capacity and low latency more than CPU cycles. Virtualization, generally speaking, can be manageable for many CPU and memory-bound tasks, but is almost always destructive to the performance of network and disk subsystems. As a result, we employ no virtualization save for a few test machines. And while we run relatively commodity network gear, our decisions around topology and infrastructure have been made deliberately and carefully to maximize performance. In our experience, the compounding effects of latency and IO bottlenecks quickly become a non-linear problem.
- even with the best of intentions, virtualization kills performance
- compress everything, keep it compressed even in RAM until it’s needed
- all data must stream (compressed!), remote procedure calls are evil
- data must be sharded for locality of processing, not physical locality (see previous)
- model around simple tasks that can checkpoint and replicate
- anything that provides transactional safety between checkpoints works against you
- design for (constant) failure whether in a public or private cloud
AddThis employs top-tier CDNs for most static web-asset serving, but we have still chosen to internalize our http-based APIs. Even though cloud offerings excel at many web-service scenarios, they mainly rely on traditional LAMP stacks with limited back-end commitments beyond scripting and modest databases. In these environments, virtualization is usually OK because most instances vastly under-utilize system resources. But when contention rears its head in the cloud, there is no recourse other than take the penalty or provision random new hardware.
Our “edge” and eventing API are http services that require maniacal tuning for performance and reliability. With billions of requests a day spread across millions of domains, our code is trusted by site owners to respond quickly. Latency *is* supreme. Flowing from our edge and eventing APIs to the backend are terabytes of real-time data. This live stream is forked into log storage and real-time systems. Proximity to our compute clusters enables both faster and more economical processing of data by avoiding repeating data across high-cost, high-latency links to remote data centers.
- commodity network gear with the right topology and software can work wonders
- optimization of networks, proxies and kernels remains a high art
- chrome does unspeakable things (for another day)
AddThis’ compute clusters are constantly processing and re-processing hundreds of terabytes of data, applying new algorithms, generating new metrics and finding deeper insights. Our current model supports complete visibility into optimizing our stack. With a top to bottom view on hardware and software, we can squeeze out more than 10 times the performance per dollar than we can get from cloud-based offerings. Equally important, we can do this reliably and predictably.
Continue to Part III: A Simple Decision – Performance, Predictability, Price
This is the first in a three part series on the processing and infrastructure of AddThis. Be sure to check out part two and part three.
Sailing Past the Clouds
The emergence and maturation of multi-tenant, virtualized, commodity computing (“the cloud”) has enabled entirely new business models by bringing large scale distributed computing at an affordable price point. But by designing systems flexible enough to function well in most scenarios, but tackling no specific problem, cloud vendors have made architectural trade-offs that are particularly challenging to us. Over the last five years, we’ve built a lower cost, more effective, more scalable solution that is closely tailored to the needs of a large scale data, low latency businesses like ours. And in doing so we have gained control over performance, predictability and price.
A Little History
AddThis’ success in the Widget business, starting in 2006 / 2007, catapulted us directly into scale problems that early clouds of that time were not ready to handle. At that time, our platform was providing real-time analytics for hundreds of thousands of widgets being viewed hundreds of millions of times a day. It doesn’t seem like much compared to the billions of events we’re tracking today, but it was on par with the social giant of that time: MySpace.
Early decisions, which were formative to our thinking, were based on pragmatism. We were small and the price to upgrade databases at the scale we needed was out of the question. The performance of scripted / hosted commodity web stacks could not predictably meet our needs. Before long, we were forced to replace our SQL databases with a much faster (and simpler) in-house developed noSQL store just to keep up with massive data growth (the term noSQL, like Hadoop itself, didn’t exist yet). We found we could get better scale with operational safety and huge cost savings by architecting around a few key learnings:
- Databases can be fast, but network connections and serialization are slow
- Transactional safety is untenable overhead for real-time anything
- In-process local data CRUD is at least 1000 times faster
- Have hot spares/replicas, stagger checkpoints
- Backup only logs, not databases
- Design around replay from logs
Continue to Part II: AddThis Today