Today we sit down with co-founder and CTO Phillip Liu to get his thoughts on the launch and the technology vision of SignalFx.

How are you doing today?

Great after our successful launch! We have spent the last two years building some great software and solving some very difficult problems while working closely with a great set of beta customers. We are very excited to share our work with the rest of the world.

What inspired you to build SignalFx?

My career has always been directly or indirectly involved in monitoring systems or applications. Most recently I spent a number of years at Facebook and my experience there shaped many of our initial ideas for SignalFx.

At Facebook we started out like a lot of other web and SaaS companies using open source monitoring tools like Nagios and Ganglia. We quickly realized that, given our growth and scale, we would be spending more time and effort maintaining and customizing these open source tools than it would take to build something tailored to our needs. So we decided to go back to the drawing board and figure out what our ideal monitoring solution would look like if we built one from scratch.

It was during this process that the way we looked at monitoring a large-scale web application fundamentally changed at Facebook. We couldn’t just look at individual components anymore. The amount of noise we were generating from component-level alerts was staggering. At Facebook scale, there could be several thousand alerts going off at any given time even if there were no problems with the overall service. We needed something smarter. We also had huge amounts of useful information about the state of our applications in log files, but no easy way for our development teams to extract insights from that data without getting other teams involved.

We concluded that we needed to shift monitoring into a centralized service that could look at patterns across entire populations of systems and applications, instead of focusing on check against individual systems. This organically evolved to become ODS, the metrics-based monitoring system at Facebook that processes trillions of metrics a day and is used by every developer and operations engineer at the company to monitor the production Facebook application. Time series metrics proved to be the most compact way of sending data.  Once we were able to get all these metrics into a central data store we then started to ask ourselves “what do we want to do with the data?” From there we naturally started doing more and more sophisticated aggregations, analytics, and visualizations.

How did your experience at Facebook influence how you designed SignalFx?

When Karthik and I started SignalFx in 2013, we realized that technically the monitoring landscape had not changed dramatically since the analysis my team had done in 2008 at Facebook. Some open source projects had emerged, and some startups had begun to commercialize or mimic them–but largely web and SaaS companies were still struggling with the same challenges we’d had at Facebook.

How is SignalFx different?

We believe that monitoring modern applications is inherently an analytics problem. The investment to build a state-of-the-art, homegrown monitoring solution can be quite substantial and our experiences in operating such systems at scale for large scale web companies has enabled us to build a product with both greater capabilities and lower cost than most could do on their own. SignalFlow™, our core technology, is a streaming analytics engine that takes monitoring away from component-level alerts to being more meaningful. Users create SignalFlow analytics pipelines that perform statistical aggregations and transformations of time series data, both real time and persisted, as it flows through the SignalFx service.  In addition, multiple analytics pipelines can be combined and compared to generate new time series. The output of SignalFlow analytics is usually available within two resolutions of time series reporting frequency. This responsiveness is important in reacting to anomalies as they’re detected. All the capabilities of SignalFlow are available in an interactive and intuitive user interface.

sfx-arch
 

Without giving away too much of the secret sauce, what has been the greatest technical achievement of SignalFx so far?

We’ve spent the past two years building a state-of-the-art monitoring platform. There are many things I am proud of for the team along this journey to launch, but three things stand out the most:

  • Getting robust streaming analytics at scale, not just streaming data. That is really challenging and quite novel, especially when you think about handling all the nuances of data coming from all kinds of sources, whether it be too much, late, repeated or inconsistent. It’s actually a tremendous problem when you get to the scale of billions of data points coming in at a time.
  • Making the user experience easy and intuitive when you have a really powerful platform dealing with large masses of constantly moving data. I want people to think “This makes sense to me, I can explore the data and figure out anything.”
  • Providing accurate analytics in an ephemeral deployment environment. Many of our customers employ elastic techniques to quickly turn up and down compute capacity. In this environment, it’s difficult to keep monitoring configurations up-to-date. Our system detects when sources stop reporting metrics and automatically remove these stale sources from the analytics pipelines.

We are looking forward to having teams try SignalFx, give us feedback and to start looking for patterns. The common and creative ways you use SignalFx will direct the evolution of the platform.



About the authors

Phillip Liu

Phil is the CTO and Co-founder of SignalFx. Previously he led the development of Facebook's Infrastructure-as-a-Service platform and several web-scale application management solutions. Phil has more than 20 years experience in distributed systems, and was a Distinguished Technologist at HP.

Enjoyed this blog post? Sign up for our blog updates