Tapjoy_Logo_white

In this post, Weston Jossey, Head of Operations at Tapjoy talks to us about how he’s using SignalFx to move beyond Nagios checks and empower developers with analytics.

About Tapjoy

Tapjoy is an app LTV platform that empowers developers with the tools and automation that they need to acquires great users and monetizes them through their industry leading advertising platform.

Tell us about Tapjoy and your team.

Weston: Tapjoy is a worldwide company focused on LTV optimization & app monetization. We have over 200 hundred employees, over 70 engineers, and around 8 full-time operations engineers. My team is focused on all things production ops, which includes infrastructure, configuration management, deployment, and monitoring.

Tell us a little bit about the nuts and bolts of your application

Weston: We’re one of the largest Ruby shops in the world, serving billions of requests per day and more than a trillion per year. We’re heavy users of AWS, running EC2, SQS, and RDS. We also run OpenStack private clouds in multiple colos (run by Metacloud, now part of Cisco) around the world. Our total infrastructure size is about 1500 VMs. Tapjoy is written primarily in Ruby, Scala, and Go.

What kind of challenges do you face with monitoring?

Weston: The main challenge we have is the complexity around using check-based monitoring. Writing good checks is not intuitive. We couldn’t have a new hire understand it all in the first couple of weeks. Monitoring is only successful if everyone is involved. Writing good quality checks and not having alert fatigue is hard. So much noise and pressure was being put on the ops team that people started to ignore it rather than quickly react. You really need to be efficient: if an alert goes off then it should be actionable. Checks are system specific, making it difficult to alert on aggregations like percentiles, so we’ve been transitioning to metrics-based monitoring.

What does your monitoring stack look like now?

Weston: We use CollectD to gather metrics and send them to SignalFx for all our production monitoring of systems and services. We still use some checks for system level monitoring but will gradually move as much of our monitoring off that method as possible. For code level stack traces and performance monitoring we use New Relic.

"One of the biggest things is that the UI is so simple to use. We will empower engineers to write their own metrics, to monitor their own software better."

Weston Jossey
Weston Jossey Head of Operations

How do you use SignalFx?

Weston: Our two big use cases right now are:
  • Monitoring our production Riak database that handles 250,000 ops/sec at peak
  • System level monitoring for all prod systems with CollectD for things like CPU, disk, RAM, etc.

Tapjoy-Riak-Dashboard

Soon we’ll be sending business metrics to correlate those KPIs with service and infrastructure performance; for example: total impressions at any moment, total clicks at any moment, total conversions, or if any of those went up or down in a statistically significant way tied to a particular code change or roll out.

We frequently use aggregations like 95th percentile and min/max’s, as well SignalFx‘s “timeshift” capability which allows us to stream those aggregations and show them alongside the same exact aggregations from a day, week, or more ago — side by side within seconds of the raw data streaming in. This is very valuable when you’re trying to diagnose a problem because it becomes very evident what has changed.

And we love looking at the number of raw ops/sec flowing through our service.

About the authors

Aneel Lakhani

Aneel is a marketer. Previously he worked on marketing at other startups, served as a Research Director at Gartner, and did stints at big companies like Cisco and IBM doing everything from sales engineering to product management and large scale systems architecture.

Enjoyed this blog post? Sign up for our blog updates