By now you may have heard of the security vulnerability CVE-2014-0160 popularly known as Heartbleed. This was a bug in the popular OpenSSL encryption library used by a majority of web sites, including AddThis. Continue reading
Today we are happy to announce that Hydra—the core of our data processing platform—is now open source and available on github. It’s freely available under the Apache License for anyone to use, and we look forward to seeing just what people do with it!
We are big fans of Coda Hale’s Metrics library. Virtually all of our applications emit copious metrics that we use to monitor application state, optimize performance and resource allocation, and keep tabs on business metrics. In short: to know what is going on. We like it so much for our Java applications that we made it easy for any language with a network stack to participate with MetricCatcher and Phetric.
All of these metrics are sent to Ganglia and Graphite for dashboards, big screens filled with graphs, and fault detection. Ganglia has a solid architecutre of distributed metrics collection out of the box, and Graphite is known for its sweet graphs. We have contributed to both the Metrics Ganglia reporter and one of the available Ganglia-Graphite bridges.
Unfortunately, we ended up with a messy slew of ways to configure where we were sending metrics–Java properties files, Spring XML, even the occasional hard coded in-app configuration. It also annoyed our engineers that the Ganglia configuration was repeated both in each app and in the local
Shocking though it may be, it’s also possible to have too many metrics. We have on occasion, for example, had a distributed service keep a Timer for every other service it talked through. This was totally awesome for in situ performance debugging, but the combinatorial explosion of unique host-metric-service tuples was too expensive to keep around.
Metrics-Reporter-Config tries to solve both of these problems:
- Configuration of multiple reporters is in a single yaml file. This should fit particularly nicely into any Dropwizard services you might have written.
udp_send_channels can be extracted from a gmond.conf instead of repeating them for each app.
- It’s easy to add white or black lists for which metrics go to which reporters. We usually use this to expose some of those timing metrics only via JMX.
There are a bunch of options there, but usually we just use a simple config that looks like this:
ganglia: - period: 60 timeunit: 'SECONDS' gmondConf: '/etc/ganglia/gmond.conf' predicate: color: "black" patterns: - ".*JMXONLY$"
Jars are already on their way to Maven Central. Check out the README and get started. We hope that others find this code as useful as we have. Feedback, comments, patches, bugs, and forks are all welcome.
Here at Clearspring we like to count interesting things. How many users have visited a site? How many unique URLs are shared every hour? It turns out that counting is a non-trivial problem when there are several billion things to count. And counting is only the first step. What about the frequency of the things you are counting? Maintaining a complete multiset — with billions of elements indexed by multiplicity — for each dimension of interest is rarely practical. So, we’ve developed a set of utilities to help make counting to a billion easy.
Today we are pleased to release those utilites under an open-source license. Stream Lib is a Java library for summarizing streams of data. Included are classes for estimating:
- Cardinality (aka “counting things”): Instead of storing an entire set of elements it is possible to instead construct a compact binary object that will provide a tunable estimate to how many distinct elements have been seen. This reduces memory requirements by orders of magnitude. There is a significant body of academic literature on approaches to this problem. We have tried to provide useful implementations of those ideas.
- Set membership: Bloom filters provide a space efficient way to test for set membership. They have the useful property that their are no false negatives, only false positives within whatever bounds you specify. The wikipedia page has a good section on the interesting variants that have been developed over the recent years. We have adapted Apache Cassandra’s well tested implementation for standalone use.
- Top-k elements: While counting the number of distinct elements with a cardinality estimator is cool, sometimes (well often) you also want to know something about those elements (such as the most frequent ones). We have some early work in this area (a stochastic topper).
There is a Readme to get your started with the code. We hope that others find this as useful as we have. Feedback, comments, patches, bugs, and forks are all welcome.
Think this is cool? Apply for a job to join the team!