This is the third chapter of our multi-part blog series on the shortcomings of traditional APM solutions for monitoring microservices based applications.
Previously, we covered:
- How traditional APM solutions fail to provide complete visibility into the application performance because they rely on probabilistic and random, head-based sampling.
- How traditional APM’s instrumentation approach is a misfit in the modern containerized and microservices world because of the use of heavyweight, proprietary agents and not leveraging system-wide observability sources such as service mesh.
This post explains how the alerting and troubleshooting capabilities of traditional APM do not address the evolving requirements of monitoring microservices based applications.
A highly-functioning alerting system is the starting point for problem-resolution. Traditional APM vendors have built mature alerting functionality, which works fine for monolithic apps. However, microservices architectures pose significant alerting challenges for traditional APM solutions because of the dynamism and ephemerality of container environments and the distributed nature of microservices.
Fatigue and overhead from alert storms
Traditional APM tools are designed to collect and report on performance data at the individual component level. This works well when your applications are monolithic and run entirely on a single runtime, but cloud-native architecture significantly increases the number of components to monitor as breaking up monolith results in multiple microservices that also run on many more containers. The traditional approach of alerting on an individual component basis is a recipe for alert noise because a performance issue with a particular component often creates a domino effect with upstream components causing additional alerts repeating indiscriminately resulting in the alert storm. Alert storms inhibit the triage process to the point where getting no alert is perceived as the lesser evil.
According to Gartner, Inc:
“Most APM solutions were designed for a prior generation of applications that were monolithic and long-lived. These approaches are ill-suited to the dynamism, modularity and scale of today’s emerging microservice-based applications.”
In microservices architectures, alerts should be highly contextual with topology awareness and correlation. For example, alerts from upstream services should automatically be muted if a downstream service is deemed as having a performance issue.
Missed alerts for outlier anomalies
You can’t alert if you don’t see an issue to begin with. Traditional APM tools use head-based sampling, which takes a random approach to analyze trace data. These traditional APM tools will fail to alert on all outliers and/or intermittent issues because they randomly sample transactions for performance analysis. This random sampling approach is why alerts are missed even when the end-user experience is being impacted. We covered the shortcomings of this random, head-based sampling approach in a previous blog.
“Your alerts should tell you about a performance problem before your customers will. The tools that used to work ten years ago are no longer sufficient to monitor p99 cases in distributed systems because these tools do not see everything across the system.”
Sr DevOps Engineer, Digital Marketing Platform Company
Slow alerts on detected anomalies
Traditional APM solutions require several minutes before they notice a performance deviation and even more time before they fire an alert. This is because traditional APM solutions are based on a batch and query architecture that is high latency and becomes even slower as the number of dimensions you want to consider for alerting grows.
Siloed perspectives for infrastructure and application
Today’s microservices environments are increasingly dynamic, modular, ephemeral, and loosely coupled, making it difficult for domain-specific traditional APM solutions to provide a unified, single-pane-of-glass view across infrastructure, platforms, and application monitoring. Even if infrastructure and applications monitoring capabilities are provided by the same vendor, users are expected to connect the dots and manually correlate events as in that APM and Infrastructure performance metrics are usually displayed in separate tabs without automatic correlation or context.
A fragmented application and infrastructure perspective from APM vendors leads to lengthy time consuming war-room situations during the root-cause analysis process. There is nothing inherently wrong in creating a war room, but the APM tool should streamline collaboration.
A next-gen APM solution for monitoring microservices can solve this problem by providing an integrated application and infrastructure view from a single-pane-of-glass – all correlated and within context.
Lack of prescriptive troubleshooting
Traditional APM tools lack cross-domain analysis to recognize patterns and the understanding of causal relationships across distributed systems. As such users are expected to examine individual trace data manually and arrive at ‘aha’ moments themselves. Leveraging data-science, next-gen APM tools need to be able to recognize the underlying performance patterns and surface those to DevOps teams for expediting troubleshooting.
We are excited to join AWS at the largest gathering of the global cloud community at AWS re:Invent. We would love to share how our customers are leveraging SignalFx to quicken their path to problem resolution and reduce MTTR.
To learn more stop by SignalFx booth and attend the following sessions at AWS re:Invent.
Nike’s Direct-to-Consumer Transformation
Nike recognized the need to accelerate its digital transformation, spur growth, and deepen its connection to consumers by individually tailoring content. In this session, learn how Nike adopted a proactive monitoring and observability culture to empower its engineering, SRE, and business teams to keep everything running smoothly at scale.
Fully Realizing the Microservices Vision with Service Mesh
The service mesh is becoming the most critical component of the cloud-native stack. While still early in terms of adoption, this new infrastructure layer has massive implications for the way companies build and operate distributed systems.