The Old Way: Element Managers & Health Checks

Prior to the rise of the cloud, infrastructure health was primarily understood through the spectrum of simplistic IT health checks. Tools like Nagios and HP OpenView would pull status updates from the various machines and devices across the network. They’d report when any server or switch failed to respond to a ping or behaved out of the ordinary, and the infrastructure team would respond accordingly.

As the hardware and software stack became more complex, operations teams relied on additional information to build a more complete IT infrastructure monitoring view. In addition to health status, specialized application testing, network management, and server monitoring tools helped with performance engineering and analysis at each layer of the stack. While a collection of element managers could provide specific insight into events at the database or storage layer, for example, so-called Manager-of-Managers technologies like IBM Tivoli, BMC PATROL, and CA Unicenter became necessary to capture, correlate, and make sense of the abundance of operations data.

Infrastructure Monitoring: The Old Way

Operations teams primarily used a Manager-of-Managers to determine when a problem was significant enough to page someone in the middle of the night. However, as infrastructure and applications shifted to elastic, distributed cloud environments, traditional element and systems managers began to fail under the increased variety of data and complexity of performance requirements. While pinpointing the location of a down server was the largest priority for the infrastructure team under the old regime, the ephemeral nature of modern infrastructure requires a more analytical view of availability. 

In an elastic environment, a series of alerts from a systems manager on host unavailability may be pure noise, due to a normal scale-down during low traffic periods or because the service can handle individual node failures. Despite the ill fit for monitoring cloud environments, traditional monitoring remains one of the largest categories of spend in the systems management space.

Although element managers are able to send events and generate alerts when individual hosts encounter errors, they weren’t built for a service-wide view of the patterns and trends determining performance. Without analytics that aggregate metrics and provide a more dynamic view of performance relative to meaningful thresholds, even Manager-of-Managers systems are only monitoring at the surface of any environment. They don’t address the service-level infrastructure monitoring required to operate more sophisticated architectures made up of open-source stateful services, message buses, containers, and orchestration tools in the cloud.

The New Way: Metrics Aggregation & Intelligent Alerts

Analytics on time series data underlies a modern approach to infrastructure monitoring and is key to ensuring availability of today’s distributed, elastic environments in production. Analytics help aggregate service-level metrics for a better way to explore performance and spot outliers than a component view alone.

Infrastructure Monitoring: The New Way

Rather than simply waiting to pull simple events or consolidate and analyze alerts from a variety of noisy element managers (as alert aggregation tools do), a more effective solution requires real-time alerts on the metrics that actually matter to your specific architecture. By computing and visualizing rates of change, percentiles, moving averages, or variance relative to historical benchmarks, you can isolate a pattern, measure its severity, and correlate the root cause with the trend you’re observing to prevent an issue before it affects availability.

By aggregating metrics and comparing against dynamic thresholds (rather than the static limits used by element managers), you can troubleshoot and triage problems at any level of the stack in real time. Dynamic thresholds allow you to compare metrics against a chosen benchmark that may change over time—for example, the historical norm for a given time of day and day of week. The ability to spot and fix even a subtle change in latency, load, or throughput as it emerges is key to proactively operating modern applications in the cloud. For the first time, you can determine the difference between a normal change, an anomaly, and a threatening pattern to get alerts and address issues before they turn into emergencies and affect the end-user experience.

Infrastructure monitoring built on analytics also helps eliminate the false-alarms and alert fatigue that can result from simplistic health checks. By using a push model, where metrics and their corresponding metadata are reported at a regular cadence to an analytics system, an administrator can build an alert that’s based on a dynamic query (e.g., alert any time a machine reporting itself as part of the login service has a CPU anomaly). Unlike other monitoring and management tools that require reconfiguration every time you change your environment, charts and alert rules created through dynamic queries automatically survive any and all updates.

With the real-time insight introduced by modern infrastructure monitoring, application developers, infrastructure engineers, and operations teams can collaborate across the entire application lifecycle for the first time. Infrastructure monitoring complements services like application performance management (APM) and log management by filling a large gap not previously addressed: intelligent and timely alerting on service-wide issues and trends within your production environment.

Specifically, developers use an APM solution like New Relic or AppDynamics to instrument their applications and trace performance issues across transactions. However, APM data represents just one subset of information that a modern approach to infrastructure monitoring needs to process. By combining data from APM and several other element managers, a modern infrastructure monitoring solution can aggregate and alert on the metrics flowing directly from the constantly changing population that makes up most elastic, distributed architectures.

To evaluate an issue in production, log management tools like Splunk and the Elastic Stack help operations teams explore all the details of an event and determine root cause after-the-fact. But the massive detail that logs provide can’t realistically be processed quickly enough to deliver the meaningful, proactive, and timely alerts that are required to operate today’s distributed, scale-out environments.

A complete development and operations workflow requires real-time alerts that are triggered by the metrics you care about, aggregated at the service level. For every cloud application, infrastructure monitoring focused on time series analytics is essential to availability across the product lifecycle.

APM ≠ Infrastructure Monitoring

APM solutions should be used for what they are exceptional at doing: providing transaction traces and identifying bottlenecks in code. They were not designed for monitoring the service-level operations of today’s diverse environments, where several factors outside of your code can create real issues.

Although many APM solutions now come bundled with some basic infrastructure monitoring, they lack the breadth of coverage and context to provide adequate alerting in a heterogeneous production environment. Did you experience high latency between two services because the network was slow or because a load balancer was misconfigured? Was there an unusually high amount of load on that service to begin with? Were several of the nodes in that service down, and capacity was degraded?

ways-to-send-data 2

Moreover, most APM solutions require proprietary agents that perform byte-code injection. Though such a heavyweight approach might be acceptable in a development environment, most organizations prefer not to endure the expense of running a proprietary agent across the production fleet and choose to sample data from selected nodes for infrastructure monitoring instead. However, sampling doesn’t provide a reliable view of the production environment’s changing population or specific performance and is, therefore, an insufficient source of content to drive effective alerts.

APM tools help organizations easily instrument and identify bottlenecks in their code. APM vendors focus most of their development resources on the instrumentation part of the problem (e.g., providing the best tracing for Java applications), but have not invested in the downstream analytics, correlation, and alerting required of a general-purpose monitoring solution. Ultimately, they provide another source of insight that is tremendously valuable when combined with other operational data in a complete, modern infrastructure monitoring solution.

Log Management Needs Infrastructure Monitoring

The immense volume of unstructured log data generated by modern infrastructure offers operations teams deep insight into the root cause of a systems problem. However, logs are not particularly useful for alerting on real-time infrastructure issues across distributed environments. At the time of an emergency, an infrastructure monitoring solution provides the necessary service-level details to triage and remediate the issue.

Infrastructure Monitoring vs. Log Management

Metrics are the best first line of defense when dealing with a problem. Streamed into an analytics-based monitoring solution, they help the viewer narrow down to the service and application causing problems in the most timely manner. Even more effectively, modern infrastructure monitoring can generate proactive alerts on patterns that foretell a mounting concern and provide the runway to isolate, assess, and address the underlying issue before a problem affects the end user.

Because logs are primarily unstructured data, they are well suited to batch data analysis of a discrete event. However, a big data approach to logs makes them poorly suited to the real-time search and stream processing required for timely alerts. The high volumes of disk I/O and network load needed for log exploration are much better aligned to post-hoc analysis, as opposed to the high metric throughput typical of a time series database used for infrastructure monitoring.

For cloud environments, whose goal is to scale infrastructure elastically, you need a purpose-built system focused on metrics and analytics. Real-time aggregation is a job not fit for batch analytics because alerting requires much faster, more flexible insights. Log analysis for deeper exploration and investigation is ultimately a great complement to an infrastructure monitoring solution that handles real-time analytics and alerting on time series data.

APM + Infrastructure Monitoring + Log Management

Today, a more modern approach to infrastructure monitoring can help rationalize the role of the APM and log management tools that development and operations teams already use. The data and insights at each stage of the journey shouldn’t be viewed in three separate silos. Today’s smartest product organizations are managing both effectiveness and cost by flowing insights across all stages of the application lifecycle. SignalFx is the most advanced way to aggregate and alert on streaming metrics, helping today’s dev and ops teams fill the gap between APM’s pre-flight performance engineering and log management’s post-mortem event analysis. SignalFx’s real-time visibility into and analytics on the live production environment also help rationalize your existing investments with better overall results.

Modern Infrastructure Monitoring + APMs + Logs
Download our new ebook: APM + Logs Need Infrastructure Monitoring » 


Start Your Infrastructure Monitoring Trial

Try SignalFx for 14 days. No credit card required.