Looking forward to Kafka Summit next week? We are!

SignalFx consumes a massive amount of real-time streaming data from our users and Kafka is the foundation of our data pipeline. As SignalFx engineer Rajiv Kurian wrote in his guest post on the Confluent blog:

Kafka’s unique ability to combine high throughput with persistence made it ideal as the pipeline underlying all of SignalFx. Throughput is critical to the kind of data SignalFx handles: high-volume, high-resolution streaming time series. And persistence lets us smoothly upgrade components, do performance testing (on replayed data), respond to outages, and fix bugs without losing data.

Since we use SignalFx to monitor every part of SignalFx, we’ve developed some expertise in what to monitor and alert on, how to scale, and how to troubleshoot Kafka. We look at and prioritize the performance of specific metrics, like log flush latency, as proxies for the whole service.

SignalFx - Kafka Dashboard

What’s true for us is also true for our users. We spent a little time with Enrico Canzonieri, Software Engineer at Yelp, to get his take on how they use Kafka and SignalFx together:

We handle billions of message a day and have clusters that reach across regions. When we send metrics, we define the dimensions of our metric data in a hierarchical way so that we can easily aggregate across those dimensions and quickly see what is going on at any level. This enables us to look at, for instance, the input rate at the broker, topic, cluster, or datacenter level.

For capacity planning and provisioning–we track the most-used topics and partitions, as well as how load is spread across brokers. Being able to visualize and alert on trends in these metrics has enabled us to get very granular about resource planning and optimize how much infrastructure is allocated to Kafka.

In some cases we also apply arithmetic operators across time series to monitor things for which there isn’t a metric. For instance, we compute the mean and standard deviation of messages per second and bytes per second across all the brokers of a cluster, using the result to compute the coefficient of variation. This tells us how balanced a cluster is. Doing this kind of analytics fast enough to actually use for operational monitoring and alerting would be impossible without SignalFx.

We’ve taken the combined expertise from our experiences, those of users like Enrico, and through our partnership with Confluent to improve the experience of monitoring Kafka for customers new to Kafka with pre-built charts, dashboards, and alerts. Come by our booth at the Kafka Summit next week to see it in person and talk with our engineers.

If you miss us there, check out our on-demand webinar, where Rajiv goes over how it all works and what to monitor and alert on, followed by questions from the audience.

Kafka Webinar
Watch our webinar on Running Kafka at 70 Billion Messages per Day » 

 

About the authors

Aneel Lakhani

Aneel is a marketer. Previously he worked on marketing at other startups, served as a Research Director at Gartner, and did stints at big companies like Cisco and IBM doing everything from sales engineering to product management and large scale systems architecture.

Enjoyed this blog post? Sign up for our blog updates