SignalFx is integrated with Apache Mesos, a cluster manager that allows you to distribute workloads across physical servers. Mesos abstracts computing resources like CPU and memory away from physical servers, instead providing dynamic resource allocation for each application running on the cluster. Mesos helps operators increase the resource utilization and efficiency of their servers.

 

mesos on integrations page

 

You can use SignalFx’s built-in dashboards for Mesos to monitor the health of your Mesos deployment using collectd and the collectd-mesos plugin (originally written by SignalFx customer Grovo!). This plugin collects data about the overall cluster, each Mesos master and slave, resource utilization and tasks. You can get started with our Mesos integration from the Integrations page in SignalFx or download the plugin here.

If you use collectd and the collectd-mesos plugin, SignalFx provides built-in dashboards displaying useful metrics about your Mesos cluster, including:

# Slaves/Cluster
Total # CPUs/Cluster
Total Memory/Cluster
Tasks Finished
Tasks Staging
Problematic Master Tasks
Slaves by Host CPU %
Slaves by Host Disk %
Top Hosts by Slaves CPU %
Top Hosts by Slaves Memory %
Top Slaves by # Tasks Failed
Top Slaves by Tasks Lost
Top Slaves by Uptime (sec)
Tasks Running
Slaves Connected
Resources %
Messages Dropped
Top Clusters by # Tasks Running
Connected vs. Active Frameworks
Top Hosts by Slaves Disk %
Tasks Running 1w Growth %
Slaves by Host Memory %
Top Clusters by # Tasks Running
 

 

For complete documentation of the metrics available from Mesos, click here.

Using the SignalFx built-in dashboards for Mesos clusters as well as individual master and slave nodes, you can monitor the following important metrics:

 

mesos cluster - task monitoring

Task status. It’s important to keep track of the status of tasks in the cluster. An increase in failed tasks for a master or slave can indicate a problem with a framework.

 

mesos cluster - hosts and slaves

Host performance. SignalFx helps you identify the performance of individual Mesos hosts in the cluster. An increase in failed tasks for many masters and slaves on a single host may indicate a hardware problem.

mesos cluster - 1w task growth

Week-over-week change. SignalFlow Analytics makes it easy to monitor the week-over-week growth of tasks in your cluster, to keep track of changing workloads.

 

mesos master - connected slaves

Cluster connections. An unexpectedly low number of connected slaves on a Mesos master can indicate a network problem preventing them from connecting. To verify this, check to see if there’s an unexpectedly high number of dropped messages.

mesos master - connected frameworks

Task detail. On the Mesos master dashboard, you can view in detail the number of tasks that are finished, failed, lost or errored out. Monitoring connected and active frameworks can help you determine the health of your Mesos scheduler.

 



About the authors

Rebecca Tortell

Rebecca is a product manager with many years of experience helping startups make products that users love. Previously she worked at companies like Turn, Playdom, and Disney Interactive.

Enjoyed this blog post? Sign up for our blog updates