Metrics are one of the most critical pieces of understanding how your systems are operating. They were designed to collect information on all the components of system, to report on whether performance was expected, and to guide where action should be taken when something undesired is observed.

The availability of the SignalFlow 2.0 API opens up access to both the metrics collected and the power of the SignalFx analytics engine. Rather than limiting the use of metrics to either a chart in a UI or a mechanism to alert someone, API access to the underlying data of your system means you can take this functionality anywhere. This ability opens up new use cases for metrics: for operational tasks.

Beyond Charting, Reporting, and Alerting

There historically have been only a few key use cases built around metrics: charting, reporting, and alerting. Charting and reporting consists of collecting metrics from across your environment and visualizing these in charts and dashboards. Whether you are managing day-to-day operations or are on-call, you look at these charts periodically and make sure that things look as you would expect. In a post-mortem situation, you can take a look at the charts to determine whether there are trends that would have indicated an issue.

For alerting, rather than rely on human eyes to spot and track potential issues, metrics are collected and a piece of software determines whether or not to send an alert or fire off a process based on a set of conditions. However, in both these use cases, the primary consumer of the metrics data is either a chart in a UI or an internal alerting system. The use of metrics has typically been confined to these use cases.

SignalFlow 2.0 API extends the use of metrics to anywhere within your operational workflow. Not only can you access data for traditional use cases, but you can access real-time streaming data and analytic computations to open up new use cases for metrics.

Use Case 1: Cluster Management in a Distributed System

When part of a distributed system is restarted or a new node is brought into “the cluster”, there is often a time period where the cluster needs to adjust. This may be rebalancing of persisted data, such as in the case of systems like Cassandra or Elasticsearch, or priming, in the case of transactional systems.

Operations team will typically come up with arbitrary amount of time to “give” to a node before it fully becomes part of the cluster. Tribal knowledge like “give 5 mins between nodes when restarting to allow the node to get up to speed” or “wait for the transaction count to get above 500” are passed down from service owners and operations teams.

Time is easy to hard code in scripts, but what happens if the guess is wrong? While transaction counts can be tracked and looked at in a chart on a screen, but is this an efficient use of time for operational engineers? Plus it is boring staring at graphs for hours while a cluster rolls!

With the SignalFlow 2.0 API, there is no longer a need for either of these manual tasks. We can access the right metrics and perform the necessary computations in a programmatic manner so that operational tasks can be carried out automatically and efficiently.

An Example with Cassandra

For Cassandra, when a node is down, all the other nodes in the cluster will store up writes that the node should receive. This is called “hinted handoffs”. When that node comes back up, we at SignalFx have learned to wait for the count of “hinted handoffs” across the cluster to get back down to zero to know when the cluster is “ready” and “stable”. However, based on our experience in operating Cassandra, we want the count of hinted handoffs to stay at zero for two minutes as Cassandra can sometimes pause a bit when doing final handoffs. (This is one place we can’t seem to get rid of all the “guessing”.)

Let’s do it for a Cassandra cluster roll!

Step 1: Identify which metric (or a good enough proxy) to measure when the node is “ready”.

For Cassandra, we will use gauge.cassandra.Storage.TotalHintsInProgress.Count as our metric. This can be collected with the collectd agent.

Step 2: Write a SignalFlow program to gather these metrics and apply any analytics necessary.

Our Cassandra cluster is supported by multiple nodes. Therefore, we need to get the metrics data and sum those across the cluster as “hinted handoffs” will accumulate on more than one node in the cluster. Because the metrics can be jittery, we sum the data over one minute to smooth out the data and make sure we don’t get fooled by little dips in the “hinted handoffs”.

hinted_handoffs = data('gauge.cassandra.Storage.TotalHintsInProgress.Count', 
filter('cluster', 'my_cluster')).sum().sum(over='1m')

Step 3: Define when the cluster is “not ready”.

We have to figure out what the data from Step 2 looks like when the cluster is not ready. For Cassandra, we are “not ready” when there are any “hinted handoffs”.

not_ready = hinted_handoffs > 0

Step 4: Figure out what the data from step 2 looks like when the cluster is “ready”.

Once we figure what the data from Step 2 looks like for “not ready”, we can decide when the cluster is ready. Based on our experience, the cluster is ready when “hinted handoff” is zero for 2 minutes.

ready = when(hinted_handoffs == 0, '2m')

Step 5: Create a detector for the right conditions.

It’s easy to get the detector based on the conditions for a “ready” and “not ready” cluster based on the previous steps.

detect(on = not_ready, off = ready).publish('hinted_handoffs')

We can put all the steps together to get the metric data, sum it across the cluster, smooth out any little dips, and create the condition for the detector:

hinted_handoffs = data('gauge.cassandra.Storage.TotalHintsInProgress.Count', 
filter('cluster', 'my_cluster')).sum().sum(over='1m')
not_ready = hinted_handoffs > 0
ready = when(hinted_handoffs == 0, '2m')
detect(on = not_ready, off = ready).publish('hinted_handoffs')

Step 6: Execute the program.

Finally, we use the SignalFlow API to execute our program and wait for the detector to fire and clear.

import signalfx

program = '''
# get the data and sum it across the cluster as "hinted handoffs" will accumulate across more than one node in the cluster.
# also sum the data over 1 minute to make sure that we don't get fooled by little dips in the hinted handoffs

hinted_handoffs=data('gauge.cassandra.Storage.TotalHintsInProgress.Count', 
filter('cluster', 'my_cluster')).sum().sum(over='1m')
not_ready = hinted_handoffs > 0
ready = when(hinted_handoffs == 0, '2m')
detect(on = not_ready, off = ready).publish('hinted_handoffs') '''


flow = signalfx.SignalFx().signalflow(token)
try:
    computation = flow.execute(program)
    went_anomalous = False
    for msg in computation.stream():
        if isinstance(msg, signalfx.signalflow.messages.EventMessage):
        	state = msg.properties.get('is')
        	if state == 'anomalous':
            	    went_anomalous = True
        	if state == 'ok' and went_anomalous:
            	    print('Handed hintoffs cleared')
            	    break
finally:
	flow.close()

Use Case 2: Addressing Operational Anomalies

In large, distributed systems, there are occasionally interesting things happening to your service when you are not looking. One-off anomalies may spike within a very short time period. This may be due to external forces, such as increased customer demand, or an internal condition, such as unfair load balancing.

Service owners or operations teams may not always have the ability to react to these anomalies, especially if these anomalies happen very quickly. In order to proactively address these issues, one would have to get lucky to notice the anomaly and manually take action to diagnose the issue. This often takes a significant amount of time and is far more tedious than any service owner wants to take time to address.

A simple script with SignalFlow 2.0 API eliminates the risk of missing an operational anomaly and the manual effort of trying to find a needle in a haystack. We can immediately identify the issue and programmatically perform the necessary actions.

An Example with Go

At SignalFx, Go is used for our front-end ingestion service, known as SignalBoost-Ingest. We chose Go because it’s fast, easy to maintain, and works great inside of containers.

At one point, there was a situation where, every 10 minutes, the max number of goroutines would spike for a few seconds.

active-goroutines

We could quickly see that many machines were experiencing load, but that just two of the machines were disproportionately seeing the spike.

active-goroutine-spikes

With the spikes occurring on seemingly random machines, it’s very hard to know when to take action. Even with great tools like pprof to manually take a goroutine dump, the lag between observing the behavior in a chart and reacting with the right action can take seconds or even minutes. SignalFlow comes in when we want to detect the anomaly and react to it without any manual intervention.

The SignalFlow program is as follows:

ingests = filter('%s', 'signalboost-ingest*')
goroutines = data('num_goroutine', filter=ingests)
detect(goroutines > %s).publish('goroutines_too_high')
  • ingests = filter('%s', 'signalboost-ingest*')This creates a filter object specifically for the SignalBoost-Ingest service.
  • goroutines = data('num_goroutine', filter=ingests)The SignalBoost-Ingest service sends in the number of goroutines running with the metric ‘num_goroutine’. We only want those for a specific dimension, so we’re filtering on that specified dimension for everything that starts with ‘signalboost-ingest’.
  • detect(goroutines > %s).publish('goroutines_too_high') This detects when any of the machines values go above a static threshold %s.

When we detect the condition for each host, we want to automatically take goroutine snapshot. Then we need to maintain a view of all machines that are currently in that state and take snapshots for every datapoint to find out what exactly is happening. As we detect each server going below the static threshold, we remove that specific server from the saved state and take one final snapshot. We exit when all hosts have come back down to a normal state. In order to prevent the time it takes for a single snapshot from affecting the snapshotting of the other machines, we do the snapshotting in separate threads and join them all at the end before exiting.

Putting it all together, we use the SignalFlow data for the operational task:

try:
   print('Executing {0} ...'.format(program))
   computation = flow.execute(program)
   state = {}
   i = 0
   for msg in computation.stream():
       if i > 0 and not state:
           print "finished"
           break
       if isinstance(msg, signalfx.signalflow.messages.DataMessage):
           for k, v in state.items():
               print('midway goroutines {0} {1}'.format(k, i))
               threadit(dump_goroutines, (k, "mid%d" % i))
               threadit(dump_netstat, (k, "mid%d" % i))
               i += 1
           pass
       if isinstance(msg, signalfx.signalflow.messages.EventMessage):
           is_state = msg.properties.get('is')
           if is_state == 'anomalous':
               inputSources = msg.properties["inputs"]
               for k, v in inputSources.items():
                   source = v["key"][args.dimension]
                   value = v["value"]
                   print('excessive goroutines {0} {1}'.format(source, value))
                   threadit(dump_goroutines, (source, is_state))
                   threadit(dump_netstat, (source, is_state))
                   state[source] = True
           if is_state == 'ok':
               inputSources = msg.properties["inputs"]
               for k, v in inputSources.items():
                   source = v["key"][args.dimension]
                   value = v["value"]
                   print('back to normal {0} {1}'.format(source, value))
                   threadit(dump_goroutines, (source, is_state))
                   threadit(dump_netstat, (source, is_state))
                   del (state[source])


   for t in threads:
       t.join()
finally:
   flow.close()

See the whole program here for your reference.

Using the utility gdiff, we were able to compare the goroutine snapshots and quickly determine that we were receiving a lot more requests to the thrift endpoint of the service, and were able to track down the offending client.

Conclusion

The availability of SignalFlow 2.0 API opens a whole new set of use cases, and we’ve highlight just two examples of how we’ve used metrics in operational tasks. What was once a manual effort based on simple guesswork or extremely fast reactions can now be automated with more efficiency and effectiveness.

These are just the start of new use cases we’ve experiences within our team — we can’t wait to hear what new uses cases you’ve discovered with the SignalFlow API!

 

Find your signal today »

About the authors

Ted Crossman

Ted is a software engineer with more than 20 years of experience working on foundational backend services with a strong DevOps perspective. Described as “one of my best engineers” by Ben Horowitz in his book The Hard Thing About Hard Things, Ted has worked on building high performance distributed systems at companies such as Opsware and Infoseek Japan. At SignalFx, Ted works on the core analytics engine SignalFlow and infrastructure code.

Matthew Pound

Matthew is a software engineer with more than 15 years of experience working on backend services in complex operational environments including Opsware. At SignalFx, Matt leads our Go codebase and majors in data ingest with a minor in integrations.

Enjoyed this blog post? Sign up for our blog updates