At SignalFx, we use SignalFx to monitor SignalFx. Every engineer here is responsible for their code operating in production. This post is the first in a series about how we use instrumentation, metrics, and SignalFx’s streaming analytics capability to discover and solve problems with SignalFx itself.
Instrumenting systems and applications to report close-to-real-time metrics has brought a new level of observability and understanding of the behavior and health of distributed systems. At SignalFx, our philosophy is to instrument first and ask questions later, because we believe that these are two really distinct processes that don’t need to depend on each other. Instrumenting code as it is being written is easy and cheap; so is collecting this data. But asking questions without having the data to answer them is downright impossible and can even be costly when things starts breaking down. Here’s an example showing how we caught a potential issue early, tracked it down, fixed it and confirmed its resolution with the help of good instrumentation and real-time metrics.
The SignalFlow™ analytics engine is a core component of our platform. It backs all the computations, analytics and anomaly detection behind the charts and detectors of all our customers. It continuously processes hundreds of thousands of data points every second. Keeping CPU utilization in check as more and more analytics jobs are being executed is key to its scalability. But as utilization started going up from our beta users in the months and weeks before our launch, we started to notice a strange pattern in our CPU utilization chart.
Every five minutes, we saw a 90-to-120 seconds long spike in CPU utilization, relatively synchronized across the entire tier of analytics servers. From our 20% CPU utilization baseline, it would jump to almost 75% for the duration of this spike before finally coming back down. Until the next spike, that is. Coincidentally, the only thing that happens at this frequency is how often we refresh the metadata of all the running jobs in order for them to automatically start considering new time series that match the user’s query and to update the known metadata of these time series, against which groupings and aggregations may be computed.
For jobs working on large amounts of time series, querying our metadata store in Elasticsearch can be a little bit expensive. But it would not explain such a long-lasting spike in CPU load. We had to dig deeper into what happened in the processing of this metadata. After being queried for, the job’s metadata gets updated with the results of this query, leading to rounds of metadata invalidation down an internal metadata merge graph that we use to calculate the metadata of the output time series of the job. When we looked at the metrics we had from the instrumentation of this code, we saw that we were going over a much larger number of invalidation rounds than we were expecting given the number of analytics jobs running in the system and their respective “size” (in number of time series ingested).
A quick look at the stack trace of the most active threads of the JVM during one of these spikes confirmed that they were spending a lot of time doing these invalidations. This prompted us to take a second look at our invalidation algorithm to see if we could improve it.
The metadata graph was built on the assumption that invalidations were rare, only happening on metadata refresh intervals, but that derivations were frequent and needed to be fast. For this reason, it was structured with nodes only having references to their (multiple) “parents”: this made deriving the metadata for an object a simple merge of the common metadata properties of its parents, which are directly accessible. But it made invalidations difficult because an object did not keep references to its children, meaning invalidating any object involved, invariably, checking all the others multiple times until all invalidations had propagated through the graph. And “checking” an object involved a set intersection, which isn’t cheap either.
The solution was pretty simple and involved the relatively common trade-off of memory versus CPU. By keeping the references in both directions, invalidations now are as simple as walking the graph through these reverse links, invalidating objects as go with a classic breadth-first traversal algorithm. As it happened, keeping these reverse links also allowed for some minor refactoring in the data structure that helped reduce the memory cost.
After deploying the change into our staging environment, and to a production canary instance, we immediately saw the improvement. While the other instances were still showing these strong, long-lasting spikes, the canary instance was purring along at a flat and stable 18% CPU utilization, even a little lower than our previous baseline.
This isn’t about the complexity of the problem or the ingenuity of its solution. This was a fairly simple case of not structuring our internal data structures correctly, leading to far-than-optimal algorithms for situations that we underestimated. In many situations like this, we’ve found again and again that having good instrumentation everywhere in our code (even the places we thought didn’t matter that much!) is extremely valuable in identifying a problem, closing in on it and ultimately confirming that solutions work.
Interested in working on these kinds of problems? We’re hiring engineers for every part of SignalFx! [email protected]
This post is part of a series about how we use instrumentation, metrics, and SignalFlow™ streaming analytics to help us develop and monitor SignalFx.