If you went back in time 25 years and talked to a database administrator, they would tell you there’s one thing they needed to know in terms of monitoring infrastructure performance: is it up or is it down?
Back then, applications were largely monolithic. When an element faltered, an IBM Tivoli, CA Unicenter, or BMC PATROL agent would pull data on the concerning infrastructure metric. If a server failed, a periodic health check from HP OpenView would send an alert, and whoever was on call would receive the dreaded page that something had gone wrong. IT monitoring largely consisted of waiting for the next event, and the management tools were dependable and reliably expensive.
Fast forward to today. We now live in a world of clouds and increasingly complex architectures. Infrastructure is virtualized and elastic, applications consist of many interdependent services with open-source middleware performing millions of tasks each minute, and updates are constantly shipped and reversed in Docker containers. The life expectancy of any resource in a scale-out architecture can be as short as an hour, compared to previous generations of scale-up architecture, which could run on the same machines for years.
The distributed and dynamic nature of modern applications has turned monitoring into an analytics problem. Autoscale and load balancing make it impossible to judge system health and performance based solely on a view of server status and events. But most organizations still apply this outdated framework to modern environments.
A Framework for Understanding Need
So let’s propose a more evolved way to think about monitoring for the modern enterprise. Maslow’s Hierarchy of Needs is well known to most everyone who has taken an introductory psychology course. Only by satisfying the most basic needs first—such as water, food, and safety—can an individual progress through the various stages of the pyramid to the top, what Maslow called self-actualization (i.e., the pursuit of hopes and dreams).
It turns out this framework is quite analogous to the needs of modern operations and monitoring. Interestingly, organizations moving to the cloud, undergoing digital transformation, and embracing DevOps practices are not unlike humans in their tendency to focus on their future state, rather than ensuring their current, most pressing needs are met first.
I’ve simplified the Hierarchy of Monitoring Needs into three layers, each of which must be met before progressing to the next:
- Reactive Monitoring. At any given time, which parts of your service are up or down?
- Proactive Monitoring. Actively manage quality and performance. Identify and isolate issues before they impact users. Roll back changes or deploy fixes before a problem escalates.
- Business Value. Correlate user and business metrics with infrastructure and application metrics to make decisions that drive new outcomes and desired results.
Where Are We Today?
In the context of traditional, monolithic applications (e.g., Oracle, Siebel, SAP), monitoring has achieved some level of self-actualization for many enterprises. Infrastructure management is more straightforward in a scale-up context, the metrics are limited, and the applications and monitoring tools have had about 40 years through the 1980s and 1990s to advance to well-understood objectives of business relevance.
Among the small handful of massive webscale companies, a few have reached the higher stages of the hierarchy. At the second level of the pyramid, organizations begin to emphasize self-service and performance monitoring for their applications. Product teams can align development goals and roadmap to insights derived from operational metrics. Canary deployments become a common practice for new features. And performance changes are not only monitored in the context of issues arising, but also benchmarked against scalability, load demand, and infrastructure size. Many organizations believe they are at this stage of monitoring, but only companies like Facebook and Google tend to have the visibility to proactively and confidently experiment with new deployment models.
In truth, the vast majority of operations teams working with a cloud application are trying to cope with needs at the first level of the hierarchy. At this stage, production monitoring is about the primal fear that your customer or boss will know that a service is down before you do. It’s an unfortunate reality that systems and applications go down, whether modern or not. One of the first rules of operating in the cloud is that something is bound to happen somewhere sooner than later. But traditional monitoring tools are tailored to a more static environment. Health checks are especially noisy in elastic infrastructures and don’t provide advance warning of a troubling pattern. And the pace of change in scale-out architectures also means that a traditional monitoring tool needs to be constantly re-configured to reflect architecture changes, service membership, and alert rules.
Metrics Monitoring for Modern Environments
To overcome the fear of not knowing that some part of your elastic, distributed architecture has gone down, you need a new way of monitoring. By shifting from a static view of your infrastructure to a metrics view, the entire organization can aggregate and interact with a breadth of streaming time series data across the entire stack. For the first time, you can alert on any metric that matters to your use case and compare to meaningful historical patterns and populations to evolve your signal as your services and infrastructure change.
Once you’ve applied meaningful metrics and alerts to your modern environment and overcome the most basic challenges related to the pace of change, you’ll already have the tools to advance towards the next levels of the monitoring hierarchy and the context for a more meaningful analytical approach to both your infrastructure and application operations and your business and customer objectives. Ultimately, operational intelligence is underlied by the ability to ask the right questions and correlate data from multiple sources in a single, real-time view:
- How does a 5% increase in the latency of my login service affect user retention?
- If load capacity for my mobile transactions service is 25% higher than the same time last year, but demand is 45% higher, what will be the effect on revenue?
- How many standard deviations below the mean for throughput on my data ingestion service do I need to be before I miss an enterprise SLA?
As more new applications are developed in the cloud and with distributed, scale-out architectures from the outset, the objectives of monitoring will surely broaden, and web-native organizations will plan for some of the higher-level needs we already see in more traditional, static environments. However, today, as the vast majority of enterprises are just beginning to transition to elastic, ephemeral environments, understanding the fundamental needs of operating a modern infrastructure and application and how they differ from the needs of legacy infrastructures and applications is among the most important steps towards cloud-readiness.
With a new approach to measuring, observing, analyzing, and alerting on operations data—aligned to the sophistication of modern architectures but aiming to remove the complexity of monitoring them—for the first time, it’s entirely possible to self-actualize and rise to the top of our Hierarchy of Monitoring Needs.