At SignalFx, we keep finding new use cases for the product that help us optimize how we run SignalFx itself. In this post, we’ll look at how we used SignalFx to correlate the business, application, and infrastructure metrics in our service to come up with an effective way to model both capacity and cost for our big launch. 

A ton of organizations and users signed up within days. SignalFx kept processing the data and providing a real-time responsive experience, even under the significant load of new users (some small and some quite large) sending data and using our streaming analytics engine nearly all at once.

The Problem: Capacity Planning for a Complex, Distributed Service

Capacity modeling can be hard—especially for a scalable, distributed, and complex SaaS service like SignalFx. Most distributed services tend to share some or all of the following properties that contribute to making this a hard problem.

Variety of Components: A typical service has many self-made components, as well as utilizes a mix of open source or proprietary third party components. For example, SignalFx consists of more than 10 different self-made microservices and utilizes open source software like ElasticSearch, Cassandra, Kafka, Zookeeper, etc. Trying to model everything is like herding cats.

Complex Interactions: Having many, mostly independent components leads to complex interactions that are generally unpredictable. There are frequently implicit feedback loops, for example: high load triggers errors triggering retries which cause even more load. One SignalFx component, our TSDB, stores many tens of billions of data points each day and interacts with almost every other component. It holds not only data sent by users, but also the results of the SignalFlow™ analytics engine for use in charts, dashboards and detectors that fire off alerts. Understanding the interactions for even that one component is nontrivial.

Variety of Capacity Limiting Factors: Different components have different ways in which they run out of capacity. For example, the servers in our Analytics service are CPU bound, but our TSDB is predictably disk space bound, while ElasticSearch is memory bound.

Need for Both Operational and Predictive Models: The operational model is needed to answer questions your engineering manager may ask—like: are we running with enough capacity to handle outages at the host, cluster, service, or datacenter? The predictive model is needed to answer questions your product manager may ask—like: how much more capacity do we need to deploy to take on a large customer and how much would that customer cost us?

 

The Research: A Data Driven Approach to Modeling Capacity

Building on previous load testing on our infrastructure which helped us baseline the performance of SignalFx’s component microservices and model their interactions, we came up with these strategies.

Strategy #1: Reduce the Big Problem to N Simpler Problems:Loosely Coupled Evenly Load Balanced Systems” is a design principle that keeps giving when it comes to distributed systems, and one that we follow religiously at SignalFx. We can scale each component independently of the others. This also means we can model the capacity of each service independently of the others. The problem then becomes one of coming up with a capacity/cost plan for each component individually.

Strategy #2: Understand the Limiting Resource for Each Component: What will cause a component to run out of capacity? CPU, memory, disk space/IOPS, network? Load testing helps answer these questions, but one can also make reasonably educated predictions by observing some current behavior. SignalFx’s long data retention makes that easy—for example, here is a plot of load (axis 1) superimposed with CPU (axis 2) of one of our services (Quantizer) over the last two months.

Quantizer_load_vs_cpu

 

Observing different system metrics this way identifies the ‘culprit’. In the chart, clearly CPU and load are linearly correlated, which we expect for this service. Note that the purple CPU spikes are caused by new code push and restarts which cause high CPU temporarily. CPU went lower from March 5th because we pushed a performance improvement.

Things get more interesting with garbage collected processes (Java, Go) or log structured stores (Cassandra, Zookeeper) in which the limiting resource (heap memory, disk space used) keeps growing for a period of time until GC or compaction brings it down. The means or peaks of memory/disk utilization are not as interesting as the valleys, which represent the minimum amount of resource necessary. With SignalFx’s analytics, we applied a MIN transformation over time to identify the valleys.

 

JVM_heap

 

Strategy #3: Model Node/Service Capacity Based on Limiting Resource: Load tests provide the best answer, but are time and resource intensive to run. A service is also a living, breathing entity—new features get added, performance improvements get implemented, and so on. This means you need to be running load tests all the time. We’ve found judicious use of analytics to be a reasonably good substitute. Here we came up with a theoretical capacity per node of the Quantizer service. Coincidentally this number matched our previous load test results pretty well, validating the approach. The bump in capacity in early March illustrates the point about performance being an evolving entity.

Quantizer_capacity

 

Viewing capacity as a number instead of a line chart makes it more consumable for everyone. Here is a chart calculating % heap used for the Analytics service based on heap size and floor of heap used over time as shown earlier.

Analytics_heap_percentage 

 

 

 

 

   Analytics_num_jobs

 

Strategy #4: Represent Capacity in Terms of Business Metrics: Normalizing capacity based on business metrics makes this data usable by the business side of the company for forecasting and financial planning. Consider the previous chart – we could infer that the Analytics servers need 11% of JVM heap to run approximately 1000 jobs. However when a new customer signs up, we don’t know how much analytics they will do right then or over time. So how could we predict the amount of capacity to add or how much the customer will cost us? What we will know is a business metric—how many (max) datapoints they will send us per minute, because that is how SignalFx plans are tiered and priced. Representing capacity in terms of datapoint submission rate is more useful. If you’ve been following the instrument first, ask questions later strategy, this becomes an exercise of replacing one metric with another in the calculation. 

 

The Resolution

Combining the ideas together, we came up with a capacity dashboard of SignalFx components. Each component uses up one row and looks like the below. Each service’s ‘#nodes’ is obtained using a count aggregation per service, so it tracks capacity additions and reductions automatically. Here’s an example for a fictitious service “Bla”.

Bla

 

From capacity to cost: If you’re deployed in a public cloud and using mainly compute-oriented services like we are, then cost is based on the type and number of instances you have deployed. Knowing the kinds of instances used per service makes cost modeling an easy next step from here. 

As an example, for compute, you can multiply the per-hour cost by the number of hours in a month to find total cost of a component as well as the per node cost. Here is what a row in our cost dashboard might look like for “Bla”.

Bla_cost

 

Basing the model on observed data instead of a theoretical model means it adapts to user activity, capacity deployments, and code changes (good); but the numbers change continuously (bad, but is there an easy alternative?). The computed capacity limits for individual components are a little misleading, because chances are a service will run into issues far before it hits 100% memory, or 100% CPU, etc. However, we choose to run our infrastructure much below their limits to buffer against such problems and be able to maintain availability even during major failures, like datacenter outages. 

Finally, this gives us a fairly accurate idea about how much each microservice in our stack costs, which feeds into how we prioritize future performance work.

 

Interested in working on these kinds of problems? We’re hiring engineers for every part of SignalFx! [email protected]

This post is part of a series about how we use instrumentation, metrics, and SignalFlow™ streaming analytics to help us develop and monitor SignalFx.

 



About the authors

Arijit Mukherji

Arijit Mukherji is CTO at SignalFx and passionate about monitoring. He was one of the original developers of Facebook’s metrics solution (ODS), and subsequently managed the development of Facebook’s networking tools, data visualization, and other infrastructure monitoring software. While focused on the monitoring space for more than a decade, his diverse career of over 20 years also spans IP telephony, VoIP conferencing, and network virtualization.

Enjoyed this blog post? Sign up for our blog updates