A few months ago at Monitorama 2016 PDX, we introduced the next evolution of SignalFx’s stream analytics computation language: SignalFlow. We gave a sneak peak at SignalFlow’s new syntax, features and capabilities, and opened up beta access to the new SignalFlow 2.0 stream APIs.

Since then, we collected feedback from our beta users and have been hard at work polishing the language features and pushing the envelope of what’s possible to express and compute with SignalFlow 2.0. Soon, all dashboards, charts and anomaly detectors will be powered by SignalFlow 2.0 via those new APIs. Working with your data in SignalFx will be faster, more bandwidth efficient, more interactive and even more powerful.

One area where SignalFlow 2.0 brings this flexibility is in its anomaly detection capabilities, centered around the detect()function and conditional when()expressions. In this blog post, we’ll explore how to leverage SignalFlow 2.0 and these new features to create powerful and intelligent anomaly detectors.

An Example Use Case: Managing Disk Space

Even in today’s age of containers and elastic infrastructures, it’s virtually impossible to escape monitoring the available disk space on your servers or instances. A full disk is often problematic for the applications running on the host – logging can be impacted, new versions of the application can fail to deploy, or worse, data can be lost.

It’s necessary for operations teams or service owners to be alerted when the available disk space on any host in the infrastructure dips below a certain threshold, so that they can free up some space.

A Simple Approach

disk_util_pct = data('disk.utilization')
disk_free_pct = 100 - disk_util_pct
detect(disk_free_pct < 15).publish('low disk space')

The above program should read fairly easily, whether or not you’re not familiar with the SignalFlow language. Let’s break it down:

  • data('disk.utilization')gets data for all the timeseries of the disk.utilization metric. This particular metric, which will be used in all the examples in this blog post, is reported by collectd for each disk partition on each host and is represented as a number between 0 and 100.
  • disk_free_pct = 100 - disk_util_pct calculates the percentage of available disk space by simply subtracting the disk utilization percentage on each host from 100. disk_free_pct represents the same number of individual timeseries as disk_util_pct, which represents utilization per disk partition of each individual host.
  • detect(disk_free_pct < 15) emits an alert event (for the particular disk partition on a particular host) whenever the value of one of the timeseries of disk_free_pct is less than 15%.
  • Finally, publish('low disk space') instructs the SignalFlow program to publish those alert events with the given label.

This approach is easy and straightforward, but it has its drawbacks. The main problem is that this simple condition has no hysteresis, and that means the alert could flap on and off if the disk utilization hovers around the specified threshold.

For a Less Flappy Detector

The common approach to adding hysteresis is by adding a moving average. Although this can be effective in some cases, finding an appropriate duration for the moving average may prove tricky. If the duration is too short, the alert remains flappy. But if it’s too long, a sudden increase in disk consumption may not be caught in time!

SignalFlow offers a solution to this problem by making it possible to specify both an on condition and an off condition on for the detect() function. This allows you to set alerting thresholds like before, but to additionally define that you only want the alert to clear if the available disk space goes back above a certain level.

disk_util_pct = data('disk.utilization')
disk_free_pct = 100 - disk_util_pct
detect(on=disk_free_pct < 15, off=disk_free_pct > 20).publish('low disk space')
  • detect(on=disk_free_pct < 15, off=disk_free_pct > 20)creates an alert event (for the particular disk partition on a particular host) only when conditions within the block are met. In this example, an alert event is fired when the value of one of the timeseries of disk_free_pct is less than 15% as before, but only will clear when the disk_free_pct goes back above 20%.
  • publish('low disk space') publishes and records those alert events under the “low disk space” label.

Another approach is to specify how long the condition has to be true for the alert to fire or clear. In SignalFlow, this is done by wrapping the boolean expression in awhen()call, specifying the lasting argument to the desired duration:

disk_util_pct = data('disk.utilization')
disk_free_pct = 100 - disk_util
detect(on=when(disk_free_pct < 15, lasting='1m'),
       off=disk_free_pct > 20).publish('low disk space')
  • detect(on=when(disk_free_pct < 15, lasting='1m'), off=disk_free_pct > 20) builds on the previous example such that an alert event is emitted only when the value of one of the timeseries of disk_free_pct is less than 15% for a period of at least one minute.

Comparing Against the Population

Looking at individual hosts in isolation is helpful and you will get alerted if one of your hosts starts to run low of disk space. Another complementary approach is to compare a host’s disk utilization to the rest of its peers, such as within the same service. Hosts supporting the same workload are unlikely to have widely different disk usage, and an outlier is likely indicate that something is astray.

disk_util_pct = data('disk.utilization')
disk_util_pct_by_service = disk_util_pct.mean_plus_stddev(stddevs=3, by='service')
detect(disk_util_pct > disk_util_pct_by_service).publish('abnormal disk usage')

Here, we’re comparing the disk utilization of each host to the “mean plus three standard deviations” by service, which provides a good baseline to compare against (see mean_plus_stddev()). This is the same methodology used in identifying outliers with SignalFx’s Infrastructure Navigator feature.

Putting It Together with SignalFx’s API

SignalFx’s REST API allows for programmatic access and control to all of your metadata and objects within SignalFx, including anomaly detectors. The latest version of the detector API allows you to manage your anomaly detectors and create new ones using SignalFlow 2.0 programs.

In particular, this API allows you the ability to define notification rules, the severity and conditions of the alert, and how you want to be notified when the alerts are triggered. Taking our last example, here’s how you would create a new, SignalFlow 2.0-powered anomaly detector via the API:

$ curl -v --request POST \
  --header 'Content-Type: application/json' \
  --header 'X-SF-Token: YOUR_API_TOKEN' \
  --data-binary @- \
  https://api.signalfx.com/v2/detector << EOF
{
  "name": "Disk space detector",
  "programText": "disk_util_pct = data('disk.utilization')
disk_util_pct_by_service = disk_util_pct.mean_plus_stddev(stddevs=3, by='service')
detect(disk_util_pct > by_service).publish('abnormal disk usage')

disk_free_pct = 100 - disk_util_pct
detect(on=when(disk_free_pct < 15, lasting='1m'), off=disk_free_pct > 20).publish('low disk space')"
  "rules": [{
    "detectLabel": "abnormal disk space",
    "severity": "Major",
    "notifications": [{
      "type": "Email",
      "email": "[email protected]"
    }]
  }, {
    "detectLabel": "low disk space",
    "severity": "Critical",
    "notifications": [{
      "type": "Email",
      "email": "[email protected]"
    }]
  }]
}
EOF

For more details on how to configure each rule and attached notifications, check out to our Rule model documentation.

Just the Beginning

All the power of the SignalFlow language is now at your disposal to transform your metrics into actionable alerts! I hope this short walkthrough gives you a taste of how easy it is to build smarter, more accurate anomaly detectors with SignalFx. And we can’t wait to share some new use cases with SignalFx detectors in the very near future.

 



About the authors

Maxime Petazzoni

Maxime has been a software engineer for over 15 years. At SignalFx, Max is the architect behind our Microservices APM offering, and spent several years working on the core of SignalFx: its real-time, streaming SignalFlow™ Analytics. He is also the creator of MaestroNG, a container orchestrator for Docker environments.

Enjoyed this blog post? Sign up for our blog updates