What is ZooKeeper?

Apache ZooKeeper is an open-source coordination service for distributed applications. It exposes a simple set of operations that applications can build on for service discovery, dynamic configuration management, synchronization, and distributed locking. ZooKeeper is used to serialize tasks across clusters so that synchronization doesn’t have to be built separately into each service and project.

Sending ZooKeeper Metrics

Use collectd and the collectd-zookeeper plugin to capture ZooKeeper metrics and track metrics including node count, packet count, latency, watch count, data size, and open file descriptors. SignalFx provides built-in ZooKeeper monitoring dashboards displaying useful production metrics at the node, host, and cluster levels.

Apache ZooKeeper Monitoring


ZooKeeper Monitoring

The primary indicators to manage a healthy ZooKeeper service are disk usage, request metrics, active connections, and total znode count. In most cases, changes in these metrics occur at the node level. Alerting on these leading indicators will result in meaningful notifications as patterns emerge at the service level.

Disk Usage on ZooKeeper Instances: ZooKeeper contains persistent copies of the znodes stored as snapshots and transactional log files. ZooKeeper can become non-operational when disk capacity on a node runs out due to high volume of snapshots and transactional logs. Because snapshots are only deleted after a certain time period, an increase can impact host disk space availability.

Capacity through Request Metrics: ZooKeeper is intended to be used as a control panel, not as a heavy database with high throughput. An increasing number of outstanding requests indicates a lack of capacity to serve client requests or that a client service is behaving incorrectly and overwhelming the ZooKeeper cluster with requests. Similarly, the longer it takes to process the request, the more likely there is limited capacity available. Monitor the percentile distribution of request latencies to understand outliers, caused by either a single machine or a specific infrastructure issue across the service.

Active Client Connections: Alert on the number of active, connected sessions, and measure the growth rate over a specific time period. Too many client connections on a single znode can cause bursts of traffic and limited scalability. Sudden decreases are an indication of network or server issues.

Cluster Health Across All Nodes: For reliable service, ZooKeeper hosts are deployed in a cluster and, as long as a majority of hosts are up, the service will be available. The structure of ZooKeeper mandates that there is one master host and an expected number of n-1 follower hosts for the entire cluster. The total node count inside the ZooKeeper tree should remain consistent, unless a node died or a network partition occurred.

The SignalFx Difference

Meaningful Alerts: Applying duration requirements to alert rules helps to determine whether an issue requires attention. ZooKeeper clusters tend to be small with low throughput, and, therefore, increasing latency is an indication of an emerging issue. SignalFx helps set duration conditions so that you know if a problem persists longer than the window required to self-adjust.

Instant Outlier Detection:  With SignalFx’s Host Navigator view, anomalous znodes appear red in a heat map when automatic Outlier Detection is activated. Quickly isolate hosts or CPUs that deviate from the mean or median and immediately start to drill into the cause of an issue without the typical, annoying trial-and-error.

Jumpstart Success: There are many metrics specific to ZooKeeper, and knowing where to start and what to monitor can be tricky. SignalFx curates the ZooKeeper metrics that matter right out of the box alongside data from the other applications and cloud services in your infrastructure. SignalFx provides built-in dashboards and alert detector templates that give you a jumpstart on monitoring ZooKeeper in your environment.

ZooKeeper Metrics

Packets Sent
Packets Received
Size of the Data Tree
Average Request Latency
Number of Ephemeral
Nodes in the Data Tree 
Maximum Number
of File Descriptors 
Maximum Request Latency 
Minimum Request Latency
Number of Active Clients 
Number of File Descriptors 
Outstanding Requests 
Number of Watches
Number of Znodes

Start Your ZooKeeper Monitoring Trial

Try SignalFx for 14 days. No credit card required.