We are back from Portland after a fantastic three days at Monitorama. It’s been amazing to be part of this community that has grown over the last few years. As we’ve worked to continually improve our monitoring solution, develop best practices for our on-call engineers, and deliver awesome features to our customers, it’s always nice to have a community to share key learnings and discuss new ideas.

Monitorama is one of the best events to connect, engage, and learn from our broader community. We had lots of great conversations ranging from monitoring in theory to monitoring in reality.

Many of our conversations were had over the idea of how monitoring evolves once you’ve figured about how to scale your infrastructure. Using a mix of open-source tools and commercial solutions, organizations are now dealing with the reality of monitoring and managing their cloud infrastructure, microservices, and apps. This requires ingesting and analyzing a high volume of metrics from hundreds of web services, while also dealing with high cardinality.

High cardinality is a technical hurdle for everyone. While already supporting class-leading scale, we’re working hard behind the scenes to make SignalFx the ultimate observability solution. New improvements to our TSDB storage systems, our metadata systems, and our real-time SignalFlowTM analytics will soon ensure that our customers never have to compromise on scale, context, granularity or visibility into their production environments.

We know (from our own experience!) what it takes to build a real-time monitoring and alerting solution. Getting to scale is time and resource intensive — and we applaud those looking to modernize their monitoring solutions. However, it would be remiss if we didn’t share some key considerations for those starting down that path of building their own cloud monitoring:

  • Upfront and incremental infrastructure costs. The nature and quantity of data being ingested and stored requires a significant amount of infrastructure, especially storage. There are compromises that could be made to lower costs, but when users complain that queries take too long, it is common to add incremental spend on high-end hardware to improve performance.
  • A dedicated monitoring team. We heard there are teams ranging from several to more than a dozen full-time engineers dedicated to infrastructure monitoring. Those engineers often end up focusing exclusively on the operational aspects of the system and not building out new features.
  • Real-time data and insight. While there is a time and place for ingesting and aggregating logs, we’ve living in a world where fast-evolving anomalies can quickly turn into outages. For example, evaluating alert conditions against data once fully collected and stored in a database means it may be minutes before you can start taking action.

We understand that outages can be unpredictable — we experienced that firsthand in Portland. We give props to the entire conference staff for seamlessly finding, organizing, and setting up a new venue within hours of the power outage. Just as all the attendees were excited to get back to talks, we can see the analogy of how we want to ensure our service is available to those who rely upon it to manage their services and applications.

This made us give even more appreciation of the talk from Alice Goldfuss and the origins of the #oncallselfie. Being on-call can be really challenging; a sophisticated monitoring system is your best ally against alert fatigue! We heard stories of painful, draining on-call shifts, dealing with hundreds of pages, often the result of poorly defined alerts using static thresholds or not taking into account context or history. All of this is a clear indication that a more advanced monitoring system is needed. Check out our new alerting features for a taste of what’s possible!

Even though Monitorama is over, we look forward to keeping the conversation going. We’ll of course see you all there next year. In the meantime, check out the videos and let us know which ones are your favorite!

Follow us on Twitter »

About the authors

Maxime Petazzoni

Maxime has been a software engineer for over 15 years. At SignalFx, Max is the architect behind our Microservices APM offering, and spent several years working on the core of SignalFx: its real-time, streaming SignalFlow™ Analytics. He is also the creator of MaestroNG, a container orchestrator for Docker environments.

Matthew Pound

Matthew is a software engineer with more than 15 years of experience working on backend services in complex operational environments including Opsware. At SignalFx, Matt leads our Go codebase and majors in data ingest with a minor in integrations.

Enjoyed this blog post? Sign up for our blog updates