End-to-End Observability with Metrics, Traces, and Logs
At SignalFx, we’re on a mission to be the leader in enterprise-grade, end-to-end real-time Observability. We already are pioneers in real-time monitoring where we lead the industry in alerting and troubleshooting off of metric and trace data, two of the three pillars of Observability. We also leverage log data for root cause analysis via contextual deep linking into Splunk and other log analytics tools. But now we’re advancing our Observability capabilities with the introduction of log metricization by way of an official integration with FireLens, the new log aggregation service from AWS.
Why capture metrics from logs?
As more and more companies become cloud-native, the need for real-time and accurate visibility into complex application environments has never been greater. Metrics and traces provide DevOps teams with the critical information they need to spot and troubleshoot errors in real-time. But there are certain occasions where application metrics or traces either aren’t available or the specific details about the error that developers and SRE teams need isn’t available from standard metrics or traces. That’s when DevOps teams need to rely on logs.
Situations like these arise when it is neither convenient nor possible to instrument code with metrics and traces. Some of the more common scenarios include:
- Lack of understanding. Developers are deploying new application functionality or services that are quickly evolving and don’t yet know what to instrument.
- Lack of skills or best practices. Developers don’t always know how to create the correct measurement, and may also not be familiar with instrumentation best practices established by the operations teams.
- Lack of time. In the move-fast cloud-native world, developers are under pressure to ship code as quickly as possible and may not have enough time to fully instrument their code.
- Legacy applications. It’s also common for developers to work with legacy applications that cannot be instrumented or are simply not worth the investment in time and effort to instrument.
In these situations, developers may find it easier to simply dump events and error information to logs in order to maintain as much visibility into their application as possible without making a full commitment to or investment in instrumentation. And why not? Logs are the earliest form of feedback and the easiest type of telemetry data to emit. Using logs to report on errors gives developers more time to think about the relevant application metrics and measurements. They can start small and discover over time what is important to the performance of their application and, more importantly, their business. In many cases, developers will also log the details and store them in a log analytics system like Splunk. Not having to track and push metrics has the additional advantage of reducing the upfront complexity of new code, which enables developers to move faster and be more productive.
Unfortunately, logs aren’t easy to interpret and it’s hard to spot trends and what’s happening across multiple application components or services from individual logs. Moreover, not everyone has access to or is able to understand logs. On-call engineers, SREs, and other operationally-oriented team members who are more accustomed to infrastructure dashboards or do not know how the code is written typically don’t use logs at all. This is where log metricization comes into the picture.
For logs files that contain useful performance data, such as counters for application errors or log-in attempts, or log messages per error or log-in that can be counted, it is often valuable to transform log data into time-series metrics to make them more accessible to all team members and correlate them with other signals for more comprehensive Observability throughout your environment. It’s common for DevOps and SRE teams to have infrastructure metrics that needs to be combined with the event-based information also delivered via logs. For instance, an application may log every HTTP request and the metricization of the logs will aggregate the information as requests/second by response code, and then alert if there is a spike in errors. The SRE responding to the alert can get a more comprehensive understanding of the overall system by visualizing the rest of the infrastructure and application components on the dashboard and, if needed, drill down into the details to troubleshoot by viewing the log message itself.
What’s new: SignalFx Log Metricization
SignalFx now ingests metrics from log collectors and makes the data available instantly for visualization and alerting. With SignalFx log metricization, DevOps teams can now apply the advanced capabilities of the real-time SignalFx Observability platform and benefit from our streaming analytics and AI-driven alerts and directed troubleshooting along with the rich set of data found in logs. They can now use log-based metrics to quickly consolidate error and performance information into pre-built dashboards for all users, easily slice and dice the data for visual inspection, and accurately detect anomalies and outliers to trigger alerts and investigatory workflows. Log metricization also gives developer and SRE teams much more flexibility on how to aggregate the raw data to suit their own needs, without losing any granularity that may occur with pre-defined metrics.
SignalFx ingests data from common log routers, including fluentd-based AWS FireLens
While the majority of infrastructure metrics come out of the box, application and business-level metrics often require instrumentation. Before making the investment in upfront instrumentation, SignalFx log metricization provides a convenient and low-risk approach to discover what metrics are needed while making use of all the log data that you already have. With SignalFx log metricization, your logs can be used for more than just root cause analysis; they can be used for day-to-day monitoring and real-time observability as well.
New integration: AWS FireLens
SignalFx is an official launch partner of AWS FireLens, a new log aggregation service launched this week by AWS. Based on Fluent Bit, FireLens unifies log filtering and routing across all AWS container services including: Amazon ECS, Amazon EKS, and AWS Fargate. FireLens provides easy-to-configure plugins and eliminates the need to deploy separate sidecar agents for ECS and Fargate. SignalFx has published an output plugin based on the the official Amazon Fluent Bit image. The image for this SignalFx plugin contains the Fluent Bit binaries and additional plugins for AWS Firehose and AWS CloudWatch provided by Amazon. SignalFx captures event metrics from FireLens logs and correlates them with other metrics and traces for real-time monitoring, accurate alerting, and directed troubleshooting across your entire cloud environment.
The following screenshots provide and example of how DevOps teams can leverage the streaming analytics capabilities of SignalFx for real-time monitoring and advanced alerting based on FireLes log-based metrics.
SignalFx ingests an error metric from the FireLens logs
For this example, a metric called
com.firelensdemo.app.error, is created by the SignalFx FireLens output plugin. This metric is a simple count of the number of times “error” is found in the FireLens logs.
SignalFx instantly visualizes error metrics found in FireLens logs
com.firelensdemo.app.error is visualized in a SignalFx chart, which provides a quick and easy way to visualize when errors are occurring as well as their frequency. DevOps teams can glance at these charts to quickly understand trends and spot problems.
SignalFx AI-driven ‘sudden change’ alert preview
With SignalFx’s templated approach to AI-driven alerting, it’s easy to create an alert for this log-based metric with just a few clicks. This example shows a preview for a ‘sudden change’ alert, which relies on sophisticated algorithms to pick up sudden spikes in the error count, not just simple thresholds. Based on historical data and the current alert settings, the alert preview shows that 14 alerts would have been triggered over the past week.
In most cases, organizations have more than one monitoring system in place. SignalFx’s open and flexible, lightweight, and agnostic approach to data collection offers maximum support for these heterogeneous monitoring environments. AWS FireLens service is a welcome addition to our ecosystem of existing integrations that includes a broad range of open source and commercial log collectors.
- Splunk. The SignalFx Forwarder runs as a Splunk app and captures metrics from logs that are stored in Splunk. By nature of running on the search end of the data pipeline, SignalFx is able to take advantage of Splunk’s advanced query language (SPL) to search and manipulate data prior to ingesting metrics. Users can schedule jobs to query for any facet (e.g. HTTP status code) and build data tables to aggregate and pre-process high cardinality data. For example, thousands of log lines that represent user login attempts and include multiple dimensions, such as user id and source IP address, can easily be aggregated into a single metric time series.
Example SignalFx web server dashboard powered by metrics from Splunk logs
- Fluent Bit. Fluent Bit is an open source, fast, and lightweight log collector that unifies log processing and forwarding, and is fully compatible with Docker and Kubernetes environments. SignalFx has published an output plugin for Fluent Bit that sends log-based metrics to SignalFx. The plugin enables you to filter your logs for specific terms, such as “error”, “exception”, etc. and have the plugin report metrics whenever any of those terms is present in a log stream.
The SignalFx Fluent Bit Output Plugin is available directly from the Integrations Page
Below is a sample configuration for the SignalFx Fluent Bit plugin that illustrates how to capture metrics from the log output. This example filters on ‘error’ and ‘exception’ to create to metrics
com.example.app.exception, respectively as well as
env, cluster, container_namedimensions
Condition Key_value_matches log error
Add MetricName com.example.app.error
Condition Key_value_matches log exception
Add MetricName com.example.app.exception
Regex MetricName ^.+$
Add env prod
Rename ecs_cluster cluster
Token <ACCESS TOKEN>
Dimensions env, cluster, container_name
- Logstash. Similar to Fluent Bit, Logstash is an open source, server-side data processing pipeline that ingests, transforms, and sends data to a specified data visualization, storage, and analytics destination. The SignalFx Logstash-TCP monitor operates in a similar fashion to that of the Fluent Bit output plugin. It fetches events from the Logstash TCP output plugin and converts them into SignalFx data points and works in conjunction with the Logstash Metrics filter plugin that converts events into metrics.
- collectd. Collectd is a high performance and portable daemon that collects system and application performance metrics. The SignalFx tail plugin for collectd reads log files and count occurrences of events that you identify using regular expressions. This is especially useful for measuring the frequency of particular errors, such as number of failed login attempts.
- Heka. Heka is an open source stream processing software system developed by Mozilla. Clever, a SignalFx customer, developed a filter for Heka which extracts data from fields in messages and generates JSON-formatted data points that are sent to the SignalFx API.
Get Started Faster with Log Metricization
Log metricization reduces the need for DevOps team to do additional upfront instrumentation work in order to gain visibility into their systems. The wide range of SignalFx output plug-ins automatically send log-based metrics to SignalFx, letting developers and SRE teams capture useful metrics about important events for real-time visualization and analysis. SignalFx makes use of the data that already exists in the logs, so DevOps teams can quickly spot trends and receive alerts on applications and services that aren’t already instrumented, ultimately shortening time to value.
Capturing metrics from logs is yet another way that SignalFx brings together and correlates the three pillars of Observability. By capturing and combining information from multiple sources, SignalFx provides a more complete view of complex applications environments—broad visibility across multiple services using metric and trace data, and deep visibility into individual services with the data found in logs—along with sophisticated AI-driven analytics, and accurate alerting. With SignalFx log metricizations, DevOps teams can get started faster with monitoring without needing to make the full investment into Observability on day one.