Machine Learning – Scale is Relative


Yesterday, Amazon Machine Learning was announced on the AWS Official Blog. It’s great to see more tools developed to improve the data science workflow. It’s enabling the Web to be much more personalized. Nvidia is doing it with DIGITS and Matlab has a toolbox too. The growth of tools supporting data science makes personalization better, more accessible to companies, and easier to scale. The better the tools, the better we can do on the job. But – as always – we need to keep the whole picture in mind and watch the trade offs.

At AddThis, the use case for machine learning is our content classification engine, which analyzes urls that end-users engage with across the Web. This sub-system enables content recommendation and other personalization products in our offering. The scale is huge (1.8B monthly uniques, 3B page views per day, etc…read more about us), and necessitates very careful thought to pricing decisions. Amazon Machine Learning’s pricing is great to get a new project started, but at scale prices become interesting. Let’s do some napkin math:

($0.10 / 1000 predictions) x (50M URLs classified / day) = $5000 / day

Over a year that adds up (~$1.8M) not including storage, computation resources, and assuming only one prediction model per URL. Let’s not bring back “unlimited unlimited” ridiculousness, but not all scale is created equal. It’s really critical to have a picture of where your effort will end up and the costs involved. That being said, we can’t wait to try it out on a small dataset.

This post was a collaborative effort between Matt, Otto, Aditya, and Gennady.