If you spend any time in the dojo of a site reliability or operations master, you’ve seen banks of LED screens awash in dashboards, charting millions of infrastructure and application metrics as hypnotic multi-color streams. A black-belt trained in the art of DevOps can pick out an anomaly or outlier from a real-time graph like The Karate Kid’s Mr. Miyagi picking a fly from the air with chopsticks. But imagine if Miyagi had been limited to picking flies all day. Despite his fly-picking expertise, he would not only have had to come to terms with missing more flies than he caught, but he would also never have had the time to mentor Daniel LaRusso or invent the crane kick. The Cobra Kai would still run turf throughout Reseda, California, and the world would be a much different place.
Thankfully, a modern monitoring system goes beyond simply sending and displaying operations metrics. Where the old school of monitoring relied on an engineer passively observing just a few basic metrics—like CPU usage or network throughput—to spot major changes or outages, today’s systems enable the ops team to become proactive and alert on trends before they become problems.
But, as monitoring has become more effective at finding issues of all varieties, that effectiveness has also manifest a new, pervasive problem in itchy-trigger syndrome. If every metric beyond a certain threshold sends an alert—and today’s highly distributed, elastic cloud environments are, by nature, somewhat unpredictable as they scale up and down—how can you tell which, if any, of the notifications in an alert storm deserves your attention? Measuring issue duration is an essential but underappreciated way to get a higher level of confidence from your monitoring and alerts.
A Single Node Falls in the Forest
In the past, the main goal of alerting was to be kept up to date on any and all issues throughout the architecture. Before too long, however, the complexity of systems running at scale, particularly those built on ephemeral infrastructure, made alerting on the status of individual nodes a nightmare. Virtualized environments are inherently unpredictable, and a notification on a server failure doesn’t necessarily call for immediate action to improve the health or performance of the whole system. In the end, actionless alerts just become noise.
At massive-scale web companies like Facebook, folks like SignalFx’s co-founder, Phil Liu, started building monitoring-as-a-service platforms that both advanced and simplified the rules on which alerting was built. Rather than static, arbitrary thresholds that set off alerts every time a change occurred anywhere in the environment, users could aggregate systems data and apply analytic functions to get fewer, more meaningful notifications.
Chasing Ghost Anomalies
Once alerts were more efficient, the speed and scale required to take action became a double-edged sword. Alerts were more relevant and thresholds more fine-grained, but the elimination of latency could make the monitoring system more trigger-happy, particularly for tracking anomalies. Back at the dojo, Mr. Miyagi could rest his chopsticks until he got the notice of a fly to catch, but much of the time, the fly would just be a piece of dust or a shadow, leaving the sensei snapping at thin air.
The worst case for a false-positive scenario is an alert storm that could result from good thresholds, but too little qualification. Imagine it’s the night before your big karate tournament and your pager repeatedly goes off in the middle of the night because a crucial Elasticsearch workload suddenly hits your percentile threshold. You check your systems, but everything seems normal. Then it happens again. And again. You don’t get the rest you need, and even the den-den daiko meditation is no use. Your shot at winning the big tournament is ruined.
Measuring Duration → Precision Action
Chances are a failure to define a duration rule is at play. Duration serves as an absolutely necessary additional condition for your alert-if triggers. For the most part, triggers are set to fire an alert immediately upon a threshold being reached, whether it’s a metric out of range or a significant (even if only momentary) increase or drop in the rate of change for a time series.
In many cases, these changes describe a true anomaly—a sudden and temporary outlier that, if sustained, will have a material impact on system health and performance, but may also normalize just as quickly. Without a duration setting, spikes in demand on parts of your infrastructure or interim changes made by your cloud provider could not only send a heedless alert, but could start an alert storm. Alert storms are a primary contributor to burnout, eroded responsiveness and, eventually, attrition from your ops team.
Effective monitoring not only amplifies the signal that’s relevant to your specific environment and use case, but also provides the control to qualify the signal before sending an alert. For example, requiring a condition that a spike in utilization for your Elasticsearch workloads lasts at least 90 seconds—or 90 minutes, hours, days, weeks, or months—increases the confidence level of your notifications, particularly when the unexpected but manageable demand came from an ad campaign for a karate tournament that was mistargeted at Southern Cairo rather than Southern California. Better yet, an effective duration condition prevents your Slack channel from getting bogged down and keeps your ops team’s pagers from needlessly blowing up in the middle of the night.
For another example, read about how a customer used a rate-of-change function with a five-minute duration on SignalFx’s ActiveMQ integration to “un-stick” a pesky low-throughput message queue.
For even greater control, SignalFx includes a percent-of-duration trigger, which lets you set a condition for a minimum percentage of a moving duration window. For example, defining “80% of 10 minutes” would send an alert only under the condition that the threshold is reached for eight of the last 10 minutes during any 10-minute period. With the kind of control and insight SignalFx enables, your ops team will have the precision that comes from knowing an alert is actionable and will be sharp enough to quickly reach DevOps sensei status.