The addition of the Microservices APM™ product to the SignalFx monitoring platform combines the power of real-time streaming analytics for problem detection with the detail and context provided by particular code executions. This post will explain some of the inner workings of the NoSample™ Architecture, with a particular eye towards the statistical challenges and requirements that arise in distributed tracing. We provide a sketch of the statistical device in the Smart Gateway™, a key piece of the NoSample Architecture, and explain some of its benefits.
Stop Flipping Coins. Make Troubleshooting Reliable.
The idea of distributed tracing is to stitch together the execution path traversed by a request.
Operations in the transaction are timed and are represented as spans. Each span contains the parameters and the beginning and ending timestamps of the corresponding operation. In addition, corresponding spans of operations that were invoked from other operations are linked using a ‘parent span‘ relationship. Collecting span data imposes minimal overhead on the application process, but storing all of them is prohibitively expensive. This creates the need for a mechanism that selects certain traces to be retained and discards others. As in other domains, the end user is more interested in anomalous examples and accurate summary statistics, and has no need to interact with the full dataset.
Uniform random sampling is the first approach that comes to mind, and is common among both legacy and open source APM solutions. This means for some fixed probability p, every trace has probability p of being retained. It has the benefit that we can make a sampling decision on the initiating span (head-based sampling), and we need not buffer or analyze further spans if we have already decided to discard the containing trace altogether. Uniform random sampling also has the benefit that statistics calculated from the sample are accurate.
However, the most important use case of distributed tracing is understanding where and how executions degrade or fail, and therefore the most important attribute of a distributed tracing system is its ability to preserve traces that are exemplars of an underlying problem. Uniform random sampling is stymied by the fact that the overwhelming majority of code executions are completely uninteresting—they complete quickly and without errors. Unfortunately for the head-based approach, traces do not announce up front whether they will take an abnormally long time to complete, or whether a span 35 hops away will result in an error.
A distributed tracing system that cannot consistently flag anomalies cannot be used for effective troubleshooting as its awareness of the system is incomplete. We need instead a tail-based approach, in which the decision whether to retain a trace is deferred until the trace is complete, at which point its characteristics determine the likelihood that it is retained. In the next section we’ll look more deeply at what trace characteristics should be considered, and other requirements on a tail-based system.
What Makes for an “Interesting” Trace?
The most important requirement is that the trace selection mechanism should prefer “interesting” traces. There are three characteristics that make a trace “interesting”:
- Duration: among traces for the same endpoint, did this one take an abnormally long time to complete? Does this trace contain a span that, compared to the typical duration for that operation, took an abnormally long time to complete? Slower operations should increase the probability of retaining a trace.
- Errors: does the trace contain an error? Containing an error should increase the probability of retaining a trace.
- Frequency of execution: is this trace an example of a very frequently executed code path, or is the path traversed relatively rarely? Infrequently traversed paths should have a higher probability of being retained.
Almost as important, the trace selection mechanism should also produce an accurate summary of the behavior of various execution paths. Developers want to inspect anomalous traces to understand possible failure modes, but calculating (for example) percentiles on a set of traces specifically retained due to their longer latency gives a misleading (unduly pessimistic) view of the system. Similar remarks apply to estimating the error rate on a sample of traces in which errors are more likely to be retained.
How the Smart Gateway Works
The Smart Gateway maintains a collection of durations for each execution path. This provides a number of benefits:
- A method to estimate the percentile of a particular trace, which influences how likely it is to be retained.
- A source of metrics (various percentiles) for each execution path, ensuring accurate summary statistics regardless of retained traces.
- An intelligent threshold choice for deciding when a trace is complete based on observed latency for its path. 
The grouping by execution path ensures we compare like with like. The grouping is reflected in the Traces page of the SignalFx Microservices APM UI, as shown here.
We also maintain error rates and traffic per path, so we can appropriately adjust the likelihood of retention based on the frequency of execution and whether a particular execution results in an error. The error counts (and total execution counts) are also emitted as metrics, allowing for the accurate calculation of error rates, independent of the number of error traces that are retained.
Maps and Curves
Elsewhere we have shown examples of the ability of the NoSample Architecture to retain more error traces, and more traces of abnormally long duration. Perhaps somewhat more surprisingly, the NoSample Architecture also provides a more accurate dependency map, due to its preference for infrequently executed paths. The dependency map on the left is created from a set of traces retained by a (uniformly random) head-based sampler, and that on the right is created by the Microservices APM Smart Gateway. The head-based sampler misses the ‘sfc‘ and ‘shadow-quantizer‘ services, and the dependency of ‘matt‘ on ‘snowflake.’
In addition to the percentile metrics, which provide real-time insight into the behavior of individual paths and operations, the “observed” curve can be overlaid on the histogram of retained traces. As expected, the observed counts are much higher than the retained counts for shorter traces (towards the left hand side), while the retained counts become much closer to the observed counts for longer traces (towards the right hand side), reflecting the preference for traces of abnormally long duration. Note also that far more than 1% of retained traces lie above the 99th percentile.
Providing accurate summary statistics and flagging anomalies are fundamental statistical tasks in stream processing—a well-designed system must be able to do both. Any analysis and troubleshooting is on shaky foundation if your system isn’t retaining the right traces in the first place. Our unique NoSample Architecture elegantly addresses both these key issues enabling a distributed tracing solution best suited to the needs of monitoring today’s complex microservices environments.
 This is needed for the simple reason that a trace does not announce that it is complete; some traces typically complete in 20ms and others regularly take 2s or even 20s. A data-driven cutoff time must be learned in order to make a principled timeliness-accuracy tradeoff.