In this post, Angad Singh, Software Engineer on the Infrastructure team at Viki talks to us about using SignalFx to get to meaningful alerting.
Viki is a global TV site where millions of people watch their favorite shows, movies, celebrity news and more. Their content has been subtitled into 200+ languages by a community of avid fans. Viki is part of the Rakuten family.
Tell us about how Viki is built
Angad: Today Viki is comprised of about 40 microservices, written in Ruby on Rails, Go, and Node, and built on open source technologies like RabbitMQ, Postgres, Redis, Elasticsearch, and Docker. Unlike other similar companies, we run our service on bare metal with about 100 servers running anywhere from 10-20 Docker containers each.
Because Viki serves a large, global audience at all hours of the day, peaks in usage in one region, or for one show, are usually balanced out by troughs elsewhere. Looking back over years of experience with our load patterns, we’ve been able to build and tune our current infrastructure to meet demand very precisely. We’ve analyzed metric data, and have a clear understanding of what our capacity needs are and what our projected rate of growth will be as we expand so we haven’t found the need to migrate to a cloud platform with autoscaling-type features yet.
What are your operational challenges?
Angad: Before SignalFx, we used a different monitoring product that was severely limiting because it only allowed up to one minute granularity, afforded no analytics capabilities, and alerts could only be triggered on static thresholds.
I had worked at Twitter before joining Viki. They had a great observability stack that let me use analytics for alerting. We were looking for something to match those capabilities.
We wanted a real-time monitoring platform that would not only give us more granular visibility into how our entire platform was performing, but also let us combine metrics into stronger signals:
- Needed to see data at a higher resolution: performance issues within a video streaming service are noticed very quickly by our large global audience, and we can’t afford to wait for — or even miss altogether — those types of issues.
- Needed the ability to visualize and alert on trends: we wanted to manage performance in a proactive fashion, and were looking for a monitoring and alerting solution that would allow us to compute analytics, like percentiles and rates of change or success in real-time, and alert on those more meaningful signals instead of static thresholds on raw metrics from individual systems.
- Needed open-source instrumentation: in order to build our own tooling and feel supported by a community of development for collecting metrics, we preferred an open source approach to instrumentation.
Angad: We considered other solutions, but after seeing how SignalFx could enable us to monitor metrics from our entire stack at high resolution and perform complex analytics on data in real-time, we were convinced.
The most important thing was being able to do math. We wanted to do things like take five metrics, add them, and then divide them to compute rates of success. We couldn’t do this with other tools.
SignalFx would allow us to monitor our data at resolutions down to one second and do arbitrary analytics to compose metrics like percentiles, growth over time, and success rates, at that same speed.
Also, SignalFx’s integration with open source metrics instrumentation and collection built on collectd, with its large ecosystem of plugins, would allow us to instrument metrics from our entire stack. We wouldn’t have to write our own instrumentation from scratch.
How do you use SignalFx?
Angad: SignalFx has become the primary monitoring service for all engineers at Viki. For the two of us on the infrastructure team, having SignalFx available for the other 30 engineers at Viki has proven invaluable. Everyone is empowered to instrument their own metrics, and customize their own dashboards and alerts specific to the needs of their own services.
Our entire engineering organization works in a DevOps model, with teams focused on building and operating each of the microservices that make up the application. We started by pulling in infrastructure metrics for key services, like load balancing. We use the SignalFx collectd Agent and plugins from the ecosystem to instrument metrics from Docker containers and services like Redis, Postgres, RabbitMQ, Elasticsearch.
The service that we use it most for is Docker — understanding container vs. machine behavior and performance, alerting on container deployments and misbehavior.
We use SignalFx’s multi-dimensional capabilities to dig into performance and create alerts at the cluster, service, or geo level without having to create many different charts and duplicate alerting conditions for each level. Our engineers have set up alerts based on dynamic thresholds, like percentiles and standard deviations, that tell us the exact service at fault. And then notifications are shipped via PagerDuty and Slack.
SignalFx provides support for everything we use, and that’s important for a small team since we don’t have the resources to be writing our own plugins.
How has SignalFx changed your day to day?
Angad: We have achieved higher visibility into our infrastructure by instrumenting metrics at a higher resolution, with multiple dimensions, and using SignalFlow analytics. We have been able to move beyond focusing on individual system metrics, and now look at trends — we can clearly see what has changed over the last few hours or days. This has been especially helpful in monitoring load balancers, Elasticsearch clusters, and Docker containers.
Before SignalFx, we could only be reactive, finding things as they came up and taking considerable time to really dig in to find root causes. Now, we clearly and quickly see where the problem is or will be. Because we can act on trend- and service-based alerts, we frequently head off problems before end user experience is affected. Instead of focusing on individual system metrics, we now focus on trends. We’ve gotten more visibility into the infrastructure.
All of our engineers use SignalFx. It’s become our primary monitoring service.
We recently had an issue in which we saw all of our edge services crash except for one cluster. We couldn’t find the root cause, but we knew it had something to do with our custom functions in Redis. Using SignalFx to compare the timings of different services leading up to and during the crashes let us isolate what hiccuped first. By tracing the timing of problems in this way across services, we were able to find the exact source of the error.
SignalFx has changed the way we monitor our infrastructure — it’s become an invaluable tool for investigating issues in real-time. Next, we plan to instrument application level metrics so we can do some higher level profiling to see how the platform is affecting the business as a whole.