DevOps

June 07, 2020

5 Minute Read

A Deep Dive Into Built-In Anomaly Detection: How the Algorithm Works

By Joe Ross

The release of Built-in Alert Conditions and Alert Preview allows cloud operations to exploit the full power of our real-time analytics engine in a way that is both intuitive and flexible. Much of the guesswork in applying analytics has been removed: users learn within seconds how an alert would have behaved over a historical time period and can quickly test different alert configurations if necessary.

Each built-in alert condition is a recipe for producing an alert. We have captured analytics pipelines seen across our customers (and our own teams) and exposed exactly those knobs needed to tune the alert to an environment. In this post, we describe the inner workings of one particular alert condition: Historical Anomaly.

Cyclical Thresholds for Cyclical Data

Across the wide variety of metrics emitted by modern computing environments, static thresholds are inadequate. Not only are static thresholds are difficult to divine and maintain, but for many metrics, it may not even be possible to define anomalous behavior in terms of a static value.

Consider the following two week chart, which represents the total number of jobs being processed across our production analytics cluster.

This pattern — larger values midday on weekdays, smaller values overnight and on the weekends — will be familiar to many application developers. For applications that curate content, total views likely exhibits such a pattern; for social networking applications, total messages viewed or sent; for microfinance, loans originated; for search services, advertisements served.

In these scenarios, the difficulty of choosing a static threshold is much more fundamental than requiring domain knowledge or maintenance costs. The problem is that what is anomalous or worrisome is completely different depending on the time of day and day of week, and a static threshold does not have this context.

The main idea of the Historical Anomaly alert condition is to use well-selected historical data to construct a threshold to compare against the current values of a signal. We explain the parameters in the alert condition that determine the selection of data and how to prevent past incidents from influencing the threshold.

The Basics of Parameters

All of the built-in alert conditions require the user to select a Signal to monitor; in this case we have chosen the total number of jobs across a cluster. We must also choose whether to alert when the signal’s values are too high, too low, or either too high or too low (the Alert when parameter). Too few jobs may indicate users are having trouble accessing the application, whereas too many jobs may indicate a need to increase capacity, so in this case either extreme could pose a problem.

There is also a Trigger sensitivity parameter which roughly corresponds to the number of alerts: higher sensitivity will typically result in more alerts. The sensitivity is really a shortcut for several of the parameters which we’ll discuss shortly.

Suppose we wish to compare values recorded on Monday at 10 a.m. with the values recorded at 10 a.m. on preceding Mondays. In the alert condition, this corresponds to a value of 1w (one week) for the parameter Cycle length. The weekly cyclicity shown above is so common that one week is the default value for cycle length.

The parameters discussed so far appear immediately below the condition summary, which explains when the alert will trigger, and changes as parameters are changed.

The Details of Parameters

In addition to the cycle length, we need to specify how many previous cycles are used to generate a baseline for comparison. To compare values recorded on Monday at 10 a.m. with the values recorded at 10 a.m. on the preceding four Mondays, for example, we use the value 4 for Number of previous cycles in the alert condition.

To avoid defining a threshold based on just four values, we use windows for both current and historical data. So, for example, to compare the 9:45-10:00 a.m. windows from the preceding four Mondays to the same window on the current Monday, we use the value of 15m (15 minutes) for the parameter Current window.

To produce an alert condition, we will construct ranges of “normal” values and alert when the current signal values are outside that range. The option Mean plus percentage change for the parameter Normal based on is one of the methods of defining a range of normal values for the Historical Anomaly alert condition.

The first step is to take the mean of the historical windows. In our example, this would give us four historical numbers, each summarizing 15-minute windows spaced one week apart, and one current number, the 15-minute rolling mean.
The next step is to take either the mean or median of the historical numbers, depending on the value of Ignore historical extremes (Yes uses the median, No uses the mean). We choose to ignore historical extremes and will explain why in the next section.
The final step is to construct a range of normal values; this is expressed in terms of percentage change (of the median of the four rolling means).

Choosing to Alert when the value is Too low and a value of 25% for Trigger threshold, for example, will alert when the 15-minute rolling mean is at least 25% smaller than the median of the historical means. Choosing Too high alerts when at least 25% larger, and Too high or Too low (our choice) alerts when at least 25% larger or smaller.

While power users will feel comfortable experimenting with all of these parameters, we expect that many high value use cases can be tackled by tuning the percentage change threshold.

Excluding Previous Incidents

We generally recommend ignoring historical extremes (i.e., using the median) since using the mean may render the threshold useless, for example, if there was an incident last week. In our running example, if we use the median, the threshold would only be contaminated if there were two incidents spaced exactly one, two, or three weeks apart.

The benefit of using the median is demonstrated in the following example from one of our customers. The metric being monitored is the total number of messages sent by users of a social networking platform. Drops in this metric mean lower user activity, which may indicate trouble accessing the application.

In this chart, we can see the drop in the metric (the blue plot on the bottom) and the red triangle indicating an alert was triggered. Coincidentally, around the same time two weeks ago, the metric experienced a substantial, but less drastic, drop (the pink plot).

In the alert detail, we can see the range of normal values (defined by upper and lower thresholds) does not react to the pink plot’s sudden descent. Had this prior incident influenced the threshold, the range of normal values would have been dragged downward, and the alert would have been delayed (or failed to trigger altogether). Note the current plot is the rolling mean of the original signal.

Clear Conditions

One of the more powerful features of Splunk Infrastructure Monitoring is the ability to set distinct trigger and clear thresholds for alerts. When a signal hovers around the trigger threshold, other monitoring systems typically produce a sequence of alerts that clear and re-trigger in rapid succession. Splunk alerts, on the other hand, are not “flappy” due to the ability to clear an alert on an explicitly set condition rather than simply the negation of the trigger condition.

Built-in Alert Conditions exploit this feature. In this example, if we set Clear threshold to 15%, an alert will not clear until the 15-minute rolling mean is within 15% of the historical norm. Using distinct trigger and clear thresholds gives cloud operations greater confidence that when an alert clears, a new alert will not trigger moments later.

Powerful Alerts for Everyone

Flexible analytics must be a component of any cloud monitoring solution: making sense of the stream of metrics produced by modern computing environments is at its heart an analytics challenge. With Built-in Alert Conditions and Alert Preview, users of Splunk Infrastructure Monitoring can rapidly experiment with different parameters, learn how the alert would have behaved, and easily customize complex analytical solutions in order to deploy alerts into production with great confidence.

Get visibility into your entire stack today with a free 14-day trial of Splunk Infrastructure Monitoring.

Optimize Application Performance with Code Profiling

Observability tools offer many different features to help contextualize your data. This article discusses what code profiling is and shows an example of how it works.

DevOps 4 Min Read

It Takes a Village – And Some New Features From Splunk – To Scale Your Cloud Monitoring Without Breaking the Bank

New features within Splunk Observability Cloud empower engineering teams to scale their observability practice while controlling costs.

DevOps 5 Min Read

The Importance of Traces for Modern APM [Part 2]

In the second part of this blog we will explore how increased entropy forces us to rethink what is required for monitoring.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk