The growing complexity of network environments poses great challenges to effective network monitoring. This should come as no surprise: such environments can include bare metals (on-prem solutions), data centers, private and public clouds. Additionally, automation, orchestration and SDNs can greatly enhance your network management, but at the same time add yet another level of complexity. In this blog post, we will discuss the main challenges related to monitoring and how to solve them to streamline the entire monitoring process.
Network monitoring, like monitoring in general, is a broad subject. Its importance and the challenges it presents are hard to overestimate. Every network operator or service provider knows that having an exact view of their network, systems and services is a key to their business. The fact is, one must know almost immediately if something is wrong to be able to fix it and head off a negative impact on the customer. Or, they must be prepared for increasing service demand. Ideally, the latter should be known in advance, as not all resources can be scaled up easily and on demand. What we know very well from the use of clouds does not work very well for underlay networks.
So, in terms of monitoring challenges, our first consideration is what to monitor. What is currently being monitored and is there anything missing? Should we focus on the Control Plane, the Data Plane or both? And on which layer? Those are the questions we need to answer before we set out to optimize or expand a monitoring system.
Monitoring has obviously been done for years. We once used MRTG or Cacti to look at what was happening in the network. Now we work with different sources of data: we can use gRPC, and we still use SNMP. We have shifted our interest from the Data Plane towards different, narrower areas: at the state of the BGP session, for example; or we check the prefixes advertised. We verify the state of our workloads, especially those located in public clouds. Generally speaking, we want to see what is happening in our network when looking from different angles.
From our experience, there are two most common needs: to expand the monitoring system to almost all devices/system/services, and to integrate existing (and used) systems (but a couple of them) into one tool.
Consider the following example. A company has its own network consisting of routers, switches and firewalls. They also own a data center where they use VMs or containers to host their workload. But the network has a history. There are new devices, yes, but also some quite old ones that still work properly, so there’s no need to change those right away. The company has decided to implement a new monitoring system from those available on the market and they really like it. User experience is very good, and the functions it supports meet their requirements.
The only problem is, that the new monitoring system supports most of their hardware solutions, but not all. This prompts the question, can they live with missing a part of their infrastructure and stick with what they have now—it may not be perfect, but it at least offers 100% monitoring coverage. The answer is simple: No. They don’t have to sacrifice functionality and user experience when all they have to do is add the missing parts into the picture. New devices that are not yet supported can be integrated into a new system.
While this won’t be done using API calls, as the old devices do not support API, we can provide the adapters to serve as an intermediate, translation layer between potentially problematic boxes and the new system. We’ll need to use some other means of communication, such as SSH or SNMP, but in that way we can provide the visibility of the whole network, and not just a part of it.
Having a view of the whole network (so newer boxes and some legacy devices), the company wants to upgrade their internal DC to an L3-only solution. Till now, it has been a classical L2-based Data Center. But now, to better integrate with the IT world, they want to introduce pure L3 in underlay. From a realisation perspective this means using SDN (not yet present) and different types of workloads like legacy servers and VMs.
Having been covered in phase one, the underlay is no longer an issue, though the overlay layer must still be integrated with the monitoring tool and the servers (and VMs). The servers and VMs are not much of a challenge as they are well supported. Rather, the problem arises with the SDN. Not all tools understand SDNs out of the box, but there are some (and we are sure many more are on their way). So, while choosing an option, one must also take into account future needs or look to add a missing functionality, just as was done for the underlay.
What is missing from this picture is public clouds, which are more and more ubiquitous and should receive the same attention as other solutions—even if they are only being used temporarily. They should also be included in your monitoring system as well, which will keep all of your components under a single hood, and lead you to use per-cloud dedicated solutions.
It should be clear how far beyond mere resource monitoring networks have evolved. In the past, the actual number of devices or workloads was quite small and they were manually deployed and managed by humans. Now, with automation in place, operators do not have precise knowledge of what is done, when and where it is done (and all the less so in the case of public clouds). But nor do they need it. What and where are crucial questions for infrastructure providers, especially for capacity planning. For the service provider, however, the focus is obviously on offering the best service it can, not on the specific building blocks that keep it humming along as optimally as possible. This brings an entirely new issue into focus: how to visualize the data to make it meaningful?
For monitoring data visualization, is there a single reliable indicator of your environment’s performance? An indicator to which you can assign “a smiley or sad face" (something like APDEX in the case of APM)? Theoretically yes, because you can create any complex KPI. But if you consider it carefully, you will quickly conclude that it will be useful and interesting mainly for managers and business users. They need to determine quickly whether the environment is performing well or poorly.
However, if you want to gain real insight into what is happening in your network, many more indicators or metrics are needed: latency (are we under 100ms), packet loss, bandwidth vs throughput (what is effective usage of our bandwidth), jitter (what disrupts packet flow), availability, connectivity, traffic—ingress, egress. And that’s only a partial list.
What is the best way to visualize all these metrics? Of course, time series data can be visualized using the different types of classic charts: line, area, bar or stacked bar charts. But you can also use something fancier like matrix to gauge latency. Top n values for metrics can be visualized using pie charts, gauge meters or sunburst graphs.
On the other hand, if you need to visualize the state of the entire environment’s health, e.g. have an overview of your data center, heat maps or honeycombs will be the right choice. Unfortunately, while they look nice, in large environments they offer poor readability unless you enhance them with additional features like multi-level grouping for the user or data filtering.
As you can see, there are many components that enable the visualization of the most important metrics from the NPM world. But what else can you do to make life easier for users? One simple step you can take is to customize the dashboard where all these metrics are visualized. A first time user will most probably be using out-of-the box dashboards. But as they gain experience, users will want to create their own dardboards, e.g. they’ll want one dashboard for staging the environment and another for the production environment.
Bear in mind that visualization is not only about creating visual components or visual widgets. We need to design the entire flow for the user and include any integration with third-party or other system modules. To streamline the user’s operations, this needs to be „one click integration”. In order to do this, use metadata or data tagging to automate complex actions for the user. Most often, these data are already in your inventory.
Yet these visualization methods may not suffice. If you look at the environments companies today are dealing with, you will quickly realize how very complex and differentiated they can be. We therefore need to enhance our monitoring tools for today's mixed environments, which feature bare metals (on-prem solutions), own data centers, private and public clouds all joined together. Additionally, automation, orchestration and SDNs can help you build your environment, though they do further complicate both the overall monitoring process and its visualization. For instance, Ixia research shows that only 15% of companies claim that they have sufficient visibility in public clouds.
In such complex environments, your starting point with any tool should be a hybrid view, map or heterogeneous topology, whatever you call it. From there you should be able to drill down to any sub-environment specific data.
The current trend on the market is to integrate a monitoring system to match the client’s specific needs. Of course, many clients who have monitoring systems are happy with them. They may even have more than one, as they are integrated with their management systems and of course the network is never homogeneous. In any case, engineers working with those tools on a day-to-day basis can have a deep dive into the problem if one occurs just by looking at the vendor-specific tool. But this is not a holistic view. Again, we can monitor building blocks, but not the service that crosses the boundaries, even literally, as with public clouds an application can be… everywhere.
So, to streamline the monitoring process in such companies, several different solutions can be employed:
- Replace old tools with new ones that enable you to monitor heterogeneous environments. Such new monitoring solutions are available on the market, but companies are usually not willing to replace existing and battle-proven tools.
- Use an already existing platform that will serve both as an integrator and aggregator of metrics from various sources. It will have to be an open solution with a public REST API or plugin framework.
- Create your own dedicated solution that will allow you to aggregate data and visualize metrics coming from various sources such as: other monitoring tools, adapters for network devices collecting data via SNMP or SSH, among others.