Back in the 2010-2011 timeframe, the fine folks at Etsy published a series of blog posts about how they do monitoring, and articulated their philosophy in a clever and catchy fashion:
“If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.”
Last week, we were reminded again of why, when it comes to monitoring, this is a great way to think about it. Tracking everything that moves, graphing the data that comes out of that tracking, watching those graphs when you can, and having them at your fingertips is super-important, because you never know when you’re going to need it.
This proved to be true with one of our customers, who figured out there was an outage in Amazon S3 despite not having very good direct information about it from Amazon itself. In their case, they were making use of S3 for a pretty typical purpose — backing up a service and its data — and they used SignalFx to monitor the progress of that backup.
Now, to be clear, this kind of monitoring isn’t necessarily the first thing that every customer thinks to do. It’s not uncommon to have conversations with customers who think that their monitoring is in great shape because they look at system logs on a regular basis, or because they have health checks on the major components of their system, or because they have an APM system telling them about latencies in their application.
But in this case, the customer was living the Etsy philosophy, and were not only monitoring their application, but also the Elasticsearch systems that perform several key functions as part of the application – and not just the parts of Elasticsearch that the application depends on, but also the tertiary function of whether Elasticsearch was having snapshots taken periodically and stored in a repository on S3. Here’s the chart they were looking at:
Our customer had an alert set up to tell them when the rate of change for the error counts increased.
When they found out that the count of errors associated with the snapshots had increased, they started to look into what was causing the backup failures by using the high resolution metrics they sent into SignalFx. The max time to create a snapshot spiked at the same time errors started flowing through as seen in the chart below.
This gave them a rather timely insight into how S3 was not functioning correctly for them, even though Amazon itself was light on details (and ironically, wasn’t showing any issues on its status page because the icons that needed to be served up to show something other than ‘green’ were — you guessed it, stored on S3!).
To Amazon’s credit, this issue made the headlines mostly because of how robust S3 is, and how much many of us feel like we can count on it. But in this case, our customer was happy to be monitoring their use of it from an independent service like SignalFx. And to add to what Etsy said a few years back, they came to realize a few additional corollaries:
- Tracking everything that moves isn’t just about watching over the component that you’re watching. Often, the symptoms that you see are early warning signs of more severe problems that you wouldn’t otherwise know about
- Graphing the data is great if you happen to be looking at the chart, but it’s even better if you can have alerting that you can trust to tell you in an accurate and timely fashion when you should be looking at the chart