In today’s connected world, there is an expectation that your application is always on. Product organizations are increasingly shifting to a microservices architecture to support faster time to market. However, load balancing challenges often result as your resource population changes and as your infrastructure grows to meet demand. If load is not shared evenly across nodes, application performance suffers and end-user experience can be impacted.

Challenge #1: Understanding Your Load

First and foremost, you need to determine how your load is balanced across all the nodes for each service in your environment. Traditional monitoring tools have focused on per-node availability and errors—namely, checking whether the nodes in the service are up. However, this type of status check does not capture the state of the service or even the load on individual nodes versus others in the cluster that could be causing degraded performance to users.

One way to determine how effectively a cluster is load-balanced is to look at the ratio of the standard deviation to its mean, called the coefficient of variance. This statistic gives you the effectiveness ratio for load across nodes. A low ratio indicates that there are only minimal differences in load among nodes and implies that things are well-balanced. A high ratio indicates the opposite—that there are significant differences in load among nodes, suggesting cause for concern.

Load Balancing microservices example

Calculating this load balancing effectiveness ratio gives you a clear understanding of the load behaviors for your specific application. By monitoring this ratio over time, you get visibility into how product or demand changes impact load balancing, as well as the ability to catch performance problems before they cause issues for users.

Challenge #2: Dealing with Dynamic Changes

By definition, the ephemeral, hosted infrastructure underlying a cloud environment grows and shrinks based on the demands of the application or of the customer. Change is a given, whether it’s expected based on the cyclicality of your business or unplanned due to an unexpected outage of a component in the environment or a sudden spike in demand.

“Like many other companies, Clever experiences seasonal demand, and so we’re subject to huge variances in load. Our load changes based on the school calendar, like when districts open for the year, or when schools start for the day, rolling across regions.”

Mohit Gupta
Mohit Gupta Product and Engineering Lead of Infrastructure, Clever

Measuring the load balancing effectiveness ratio over time gives a starting baseline for managing a steadily performant environment. To deal with change in a dynamic world, one strategy is to compare the current ratio with the average of ratios over a specified time period. Using a moving average function helps to smooth out any transient variations that the environment may experience without causing alert noise.

Challenge #3: Knowing When There’s an Issue

Every minute it takes to resolve an issue is more than just a minute of downtime—it’s another step towards an SLA breach, a rift in customer trust, and, ultimately, lost revenue.

“There was a huge spread of average to max load across the larger cluster for an extended period of time that showed it was not being used efficiently.”

Florian Berckemeyer
Florian Berckemeyer DevOps Manager, Sunrun

Alerting based on outliers or dynamic thresholds helps remove the burden of passive monitoring and provides a notification of an issue that requires immediate attention. Constructing an alert for when the load balancing effectiveness ratio falls below a dynamic threshold based on time shift or percentile means that teams can begin triaging and remediating the issue.

Companies in the retail industry, for example, want to ensure a great customer experience and, therefore, need to be notified of any issues that may affect the end user. This is especially critical during the holiday shopping season when there is increased demand on the infrastructure. Retail companies must understand their load, deal with dynamic changes, and know when there is an issue.

While a moving average of the load balancing effectiveness ratio helps to smooth transient anomalies, setting alerts against the effectiveness ratio relative to a year ago and relative to a day ago work together to give notification of real issues. For example, the year ago alert could tell you how you’re performing relative to the holiday shopping season during the same time last year. And the day ago alert could let you know that a pattern is emerging and that you have to increase capacity to accommodate a new set of behaviors—for example, increased demand from a one-time promotion during the holiday season.

Challenge #4: Preventing an Emerging Issue

Eradicating a potential outage before it affects customers saves time, resources, and your reputation. For the first time, analytics-based alerting can help you effectively plan for business needs and take action before customers are impacted. This approach to cloud monitoring means that you can get an aggregate view of all your metrics on the entire production environment in one place. By identifying and tracking service-wide patterns, you can get notified of any significant trend before it materializes into a widespread issue, and take action proactively when it matters most.

“We are becoming more data oriented and proactive in understanding the entire system of the business, from the app down through the infrastructure.”

Stan Chan
Stan Chan Head of Core Infrastructure, Symphony Commerce

Strategies to address each of these challenges of load balancing in a microservices world is rooted in real-time, streaming analytics against the live performance data coming from the entire stack of the application, down to the modern infrastructure underneath it. The power of this kind of analytics allows you to derive meaningful insights across application performance, service availability, infrastructure capacity, and end-user experience. And a more modern, proactive approach to infrastructure monitoring uses intelligent alerts built on analytics as the source of truth, allowing your entire team to focus on availability and performance with more timely, accessible, relevant, and actionable insights.



About the authors

Jessica Feng

Jessica works in product marketing at SignalFx. Previously she was at VMware in product marketing and has experience working at start-ups, venture capital and technology investment banking.

Enjoyed this blog post? Sign up for our blog updates