At SignalFx we are working hard to help more folks appreciate the power of metrics. So, when we heard about the work that Cisco’s IOS XR team was doing around Streaming Telemetry, we knew that we would have a lot to talk about.
I sat down with Shelly Cadora, Principal Engineer at Cisco to talk about what they are doing to bring network telemetry into the world of modern monitoring and analytics.
Zimman (SignalFx): Before Streaming Telemetry, what was the state of telemetry and operational data that you could get from networking equipment and software?
Cadora (Cisco): Collecting data for analyzing and troubleshooting has always been an important aspect in monitoring the health of a network. IOS XR 6.0 provides all the traditional mechanisms such as SNMP, CLI and Syslog for pulling data. But these mechanisms have limitations that restrict automation and scale. The use of the pull model, where the initial request for data from network elements originates from the client does not scale when what you want is near real-time data. The polling mechanisms and the data transfer can cause performance problems on the target devices. Streaming Telemetry allows us to push data off of the device to a defined endpoint as JSON or using Google Protocol Buffers (GPB) at a much higher frequency and more efficiently.
Zimman: We’re seeing a similar change for every part of the modern infrastructure stack, with a steady transition to instrumentation that pushes telemetry data, in metric or log format, out from the code and devices to monitoring and alerting services. In the compute tier, this is largely being driven by new platforms like Docker, Mesos, Kubernetes, and Cloud Foundry. The storage tier is being disrupted by how we store our data and access it through things like Cassandra, Spark, ElasticSearch, Redis, Kafka, and other lower level data stores like S3 and EBS. What’s driving the change in networking? Was this something that the networking community was asking for? Who are you building this for?
Cadora: In modern infrastructure design, we’re seeing network state indicators, network statistics, and critical infrastructure information are being exposed to new control systems. These systems are doing things like automatically adding servers or resources to serve changing demands and being fed into monitoring services to correlate with server or app level performance and event data to reduce troubleshooting time. We’ve seen a steady increase in demand from Cisco customers for this capability.
IOS XR telemetry uses a push model to continuously stream interesting data out of the network to any service customers want. It provides a mechanism to identify data of interest, puts it into a structured format, and streams it to designated targets. This gives customers the ability to do things like automatic tuning of the network based on real-time data, the same way they tune other modern infrastructure platforms. The finer granularity of data, which is the equivalent of tens of thousands of OIDs per second, enables better performance monitoring and therefore better troubleshooting. This enables more service-efficient bandwidth utilization, link utilization, risk assessment and control, remote monitoring, and scalability. It lets people take advantage of real-time monitoring, alerting, and analytics services like SignalFx to improve decision-making.
The initial use cases we’ve seen have been for traffic optimization and preventive troubleshooting.
- Traffic optimization: When link utilization and packet drops in a network are monitored frequently, it is easier to add or remove links, redirect traffic, modify policing, and so on. With technologies like fast reroute, the network can switch to a new path and re-route faster than the SNMP poll interval mechanism. Streaming telemetry data helps in providing quick response time for faster traffic.
- Preventive troubleshooting: Helps to quickly detect and avert failure situations that result after a problematic condition exists for a certain duration.
That being said, our customers are creative, so we expect to see this capability being used in imaginative ways.
Zimman: How does IOS XR Streaming Telemetry work and what kind of data does it provide?
Cadora: The components of IOS XR Streaming Telemetry include:
- Telemetry Policy: specifies the kind of telemetry data to be generated, at a specified frequency, and how it should be encoded.
- Telemetry Encoder: encapsulates the generated data into the desired format (either JSON or GPB) and transmits to the receiver.
- Telemetry Receiver: is the remote management system that stores the telemetry data, which can be a collectd endpoint in the case of SignalFx.
The Telemetry Policy is created by the administrator and stored on IOS XR Route Processor. This policy is used by the Telemetry Encoder in IOS XR. The Telemetry Encoder streams JSON over TCP and/or Google Protocol Buffers over TCP. We have provided a LogStash receiver as an OSS project. Once in LogStash, or another receiver, data can be processed or reformatted for a variety of monitoring, alerting and visualization tools. We’ve built an integration directly with SignalFx to show off what can be done.
Zimman: As part of this goal to create more useable and accessible metrics how easy/hard was it to build a data pipeline to SignalFx? What does that pipeline look like?
Cadora: SignalFx makes it very easy to get data in. The team here was very excited with how easy is it was to create an integration to send the telemetry data from IOS XR to SignalFx. It was then straightforward to create charts and dashboards, set alerts, and visualize the data with SignalFx’s real-time analytics once data was flowing. The ability to slice and dice the data based on dimensions is highly valuable in environments as large as those in which IOS XR is deployed. The ability to filter the data in real-time and see what is happening across the infrastructure is game changing.
Zimman: What can you do with SignalFx that you couldn’t do without it?
Cadora: SignalFx was simple to setup, it’s cloud based and comes with a free 14-day trial. SignalFx provided us the tools to see the telemetry data at a granularity that is just unheard of for persistent network monitoring. We have always been able to do taps and captures, but those are not practical for extended periods on a production network. With IOS XR telemetry and SignalFx, we get close to that level of detail but in a form that’s useful for persistent real-time feedback but that doesn’t tax the network.
The other aspect that we have found hugely beneficial with SignalFx is the ability to easily set up alerting based on analytics. Often times with network monitoring you are willing to tolerate variations, and protocols are built to handle retries; but you don’t want a hair trigger to fire an alert that was an anomaly. Furthermore, you don’t want to have alert conditions that are too lenient and end up reacting to an issue too late. SignalFx lets us set conditions using SignalFlow analytics such as: only fire an alert if throughput deviates severely on an hour over hour period compared to the same period from yesterday or one week ago, and only if that deviation holds for more than five minutes. This gives operators a way to set conditions that actually reflect real network behavior in their environments and cut down on alert noise.
Zimman: We are excited to add Cisco IOS XR to the list of data sources that can be fed into SignalFx.
If you’d like to learn more about Cisco IOS XR Telemetry and SignalFx you can check out a joint webinar on SDxCentral with Shelly Codora and myself:
As well as the IOS-XR Developer Page