Effective strategies for monitoring containerized environments

Containerized environments are very complex, making observability both more important and difficult. One must look in a top-down manner to choose the right tools and strategies to solve them. In this article, Arijit Mukherji has some handy advice.

Containerization and microservices have dramatically accelerated software innovation.

However, these environments are far more complex, making observability both more important and difficult. The Kubernetes ecosystem has support for logs, metrics, traces built in. Nonetheless, observability challenges remain.

One must look in a top-down manner to choose the right tools and strategies to solve them. This article will give an overview of three common challenges and discuss effective strategies to address them.

Scale

Scale of monitoring data is exploding. Breaking monoliths creates many more micro-services to monitor. Kubernetes bin packing leads to much more telemetry emitted per host. Companies are asking themselves, “Where can we store all this data and how do we ensure the systems don’t start to slog under the weight of it?”

A popular scaling approach is “divide and conquer,” e.g., a separate monitoring infrastructure per Kubernetes cluster. While that initially solves the scale problem, it can result in uneven performance (so-called hotspots) and wasted capacity. It also introduces a problem that most fail to realize in advance: Queries cannot be run across fragmented data. In other words, if you need information that’s not completely stored in a single cluster, you’re out of luck.

Using an aggregator of aggregators (e.g., Thanos with Prometheus) can solve the fragmentation problem and “adding more capacity” (i.e., over-provisioning) can alleviate the hotspot problem. But a much better strategy is to put all the data in a single cluster made up of multiple, load-balanced nodes. This not only eliminates fragmentation but by spreading incoming data equally across nodes, prevents hotspots too. During bursts, the overall load on the entire cluster will rise gradually, akin to pouring water into a lake, giving operators time to react.

Component churn

Rapid innovation means pushing code more and more often: Instead of once a year, code changes once a day or even multiple times per day. Containers have unique IDs, and updating a containerized component is akin to a complete restart of the service. The result? High-velocity “component churn” that causes huge bursts of “new” data and degrades performance over time.

Many systems lack the capacity to handle these bursts and will either drop data or slow to a crawl while they digest and index all the new metadata. Seeing this, the most common reaction is, say it with me: “Add more capacity.” 

But it’s not just the storage that’s stretched in this scenario; ‘old’ data cannot be deleted if any historical view is needed, and the monitoring system buckles–or breaks–under the weight of this gradual but relentless accumulation of new data over time. And performance suffers too: A one-year chart requires stitching together 365 different, one-day segments from different containers, which is incredibly inefficient and slow.

A good strategy for the storage challenge involves separating the databases for data points (timestamp, value) and metadata (key=value pairs). This split can bring dramatic improvement: the datastore must only scale to store the total number of data points received, while the metadata store only needs to scale for the total amount of metadata created over time.

To reduce the performance degradation on queries, you can perform pre-aggregation (e.g., with Prometheus recording rules) – where common queries (e.g., average CPU of a cluster of containers for a given micro-service) are pre-computed and stored as first-class data streams. This eliminates the need to “scatter-gather” many segments and provides an efficient way to query aggregate behavior.

However, pre-aggregation always suffers from a “delay vs accuracy” tradeoff: Quickly computed pre-aggregates are inaccurate because they don’t wait for all the relevant data to arrive, and if computed after a delay, are accurate but bad for SLA due to high alert latency. Avoiding inaccuracy requires a pre-aggregation layer that is aware of the timing behavior of each data stream individually and waits ‘just the right time’, thereby producing high-confidence and timely values.

Some concluding thoughts

Modern observability datasets encompass many types, including Infrastructure & VMs, containers, third-party, OSS, application and business, orchestrators like Kubernetes, transaction flows, and distributed traces. Being forced to monitor datasets in isolation leads to alert noise and fatigue and the inability to drill down between these data sets makes it difficult, if not impossible, to analyze root cause issues that span both in order to effectively monitor high-level KPIs and SLIs.

Dealing with correlation involves data modeling and building integrations, standardizing on common metadata across all layers to power correlation (e.g., instance_id and container_id included with application metrics). It should even extend across data types such as logs, metrics, and traces to enable correlation across all three. Finally, point-and click-integrations between data sets and types reduce usability friction and allow operators to seamlessly switch between tools while maintaining context, a critical ability while debugging incidents.