Occasionally, I’ll talk to a developer or operations engineer who says they only use APM or logs as a monitoring solution. They’ll say something like, “We don’t look at logs much since we have New Relic.” Even worse, some rely on their customers to know when something is wrong: “We just check Google Analytics to see if popular pages or actions have gone down.” The surprising thing isn’t necessarily that they have a favorite go-to tool, but that they think it’s sufficient for all their needs. Have we finally achieved the holy grail: a single pane of glass that tells us everything we need to know about operations and performance?
While no monitoring solution is omniscient, vendors from different categories have been expanding their solutions. For example, both infrastructure monitoring and APM solutions can give you insight into server performance. Both APM and log management solutions can give you a summary of application errors. Some vendors claim they offer a unified view of both metrics and logs.
Do we really need solutions for both metric and log monitoring? We will explore which is best based on the sources of data, performance, cost efficiency, and key use cases. We’ll also consider hidden trade-offs in so-called “unified solutions.”
Sources of Metrics and Log Data
Monitoring solutions are typically built around either metrics or log-based sources. Metrics are often used to monitor things like system resources and performance because they are easily quantified. On the other hand, logs offer a text record of what happened on a system or application. You might be familiar with several of these sources:
Popular Metrics Sources
Popular Log Sources
- System metrics (CPU, memory, disk)
- Infrastructure metrics (AWS CloudWatch)
- Web tracking scripts (Google Analytics)
- Application agents (APM, error tracking)
- System logs (syslog, journald)
- Application logs (log4j, log4net)
- Server logs (Apache, MySQL)
- Platform logs (AWS CloudTrail)
There are dedicated monitoring solutions for each of these sources. A server or infrastructure monitoring solution typically won’t ingest your application log files. Likewise, a log management solution won’t automatically track application performance without you explicitly coding logs to track it. However, the lines seem to continue to blur as monitoring companies add integrations for more sources, leaving users with tough decisions about which to use. Let’s take a deeper look at the differences and where their strengths lie.
Data Structure Differences
At a fundamental level, what’s the difference between a metric and a log? We often think of a metric as being a measure or number of some quantity. It has a descriptive label and a timestamp at which it was measured. It may also include dimensions that provide extra categorical information. It’s usually stored in a structured data format. Here is an example from Amazon CloudWatch in a JSON format:
We often think of logs as text files created while running an application. If you are a developer, you might also think of the text printed out on the console when you test your code. This is often unstructured or semi-structured text, and it optimally includes a header with extra metadata like the timestamp and host. Below is an example of an access log from Amazon Elastic Load Balancing (ELB).
[05/Oct/2016:23:37:22 +0000] "GET /index.html HTTP/1.1" 200 118
124 + 0 "-" "ELB-HealthChecker/1.0"
Flexibility of Modern Monitoring Solutions
Modern monitoring solutions blur the lines between logs and metrics because they tend to offer functionality that helps translate between them. For example, log management solutions can automatically parse each log field and convert it into metrics. In the example above, “118” is recognized as the size of the response, and “/index.html” is a dimension indicating the request URL. Log management tools are also able to provide summary metrics such as sums and averages, similar to what we see in the CloudWatch example.
Many metrics-based monitoring solutions are also able to track unstructured text logs (e.g., errors) as events. They might store an example of the error as a dimension, along with the number of occurrences over time. This is how APM tools or error trackers work. As a more advanced option, SignalFx even allows you to track custom events with any data that you want.
Additionally, both log-based and metrics-based monitoring tools offer at least basic features on the frontend UI, including some elements of analytics and visualizations. You can plot a time-series graph of average CPU usage over time with almost any modern solution.
Optimized for Efficiency or Level of Detail
At a high level, metrics-based solutions tend to be more efficient when speed and data storage costs are considered, while log-based solutions are better for drilling down into the details after-the-fact. These differences are often driven by data-processing needs on the backend.