Safe Network and Infrastructure Changes with Pre- and Post- Checks

When thinking about network or infrastructure operation, one usually thinks of Day 0, Day 1 and Day 2 tasks. While Day 0 stands for an initial setup, Day 1 is the working production configuration, and Day 2 tasks are related to ongoing support and maintenance of a working infrastructure as the infrastructure changes according to business or operational needs.

But how to make sure that the changes made to the existing setup do not cause any issues? This is a question network operators ask themselves every time a modification is made, even when similar changes have been made dozens of times before. Sometimes it is because they need to be sure no error was made, sometimes because the service introduced is a new one, and has not been used in the production environment exactly the same way before. In any case, operational continuity must be preserved.

Let’s take a look at how it can be assured with pre- and post-checks automation.

Infrastructure - a constant change

A production environment is like a living organism - it evolves and changes over time. Changes are of two types: planned and unplanned. The first type is, in most cases, related to a planned configuration or modification when an operator wants to achieve a certain operational behavior impact. The other type is mostly related to some kind of failure, unplanned and unwanted, when the operational state changes despite having a correct setup in place. But the unexpected change of an operational state can also be influenced by an intended change done somewhere else and such situations are more likely to go unnoticed without a proper, comparative check of “before” and “after”.

Planned configuration change can also be divided into at least two subtypes: first when a change impacts only the managed network (or infrastructure) and the other, when there is interaction with an external partner and the exchange of information can be disturbed or disrupted in the process.

An example of an internal change could be an IGP link metric modification - it is something that an operator has a full control of, and the change itself cannot be easily seen by external clients unless it causes traversing traffic disruption. However, a modification of an eBGP session, being a direct relation with a peer, will be seen instantly. An update of BGP session parameters can cause sessions to go down and in turn impact the traffic. But a session going down is something that can be easily spotted and corrected. A more problematic scenario is when we modify policies on the above-mentioned BGP session and the session itself stays up, but only part of the traffic is disrupted. In this scenario, it is quite difficult to immediately find the issue without intentionally looking for it.

And being on the other side of a BGP connection is the second type of change: unintended and unwanted. In fact, let’s take the example above and for a simple scenario let’s imagine we are an ISP’s customer using a L3VPN service. We would naturally have a BGP session in place where we would advertise some prefixes. On the ISP side there would be some import policies in place in order to pass on only the valid set of prefixes. Now, if an operator on the ISP side accidentally makes an error and updates the policy in a way that it does not allow for all of the (valid!) prefixes to pass through, some portion of the traffic will be lost.

If the drop in traffic volume is significant that probably would cause some kind of alarm in both the ISP’s and customer’s systems, but when the drop in throughput is insignificant it may go unnoticed for some time. AI-based systems with anomaly detection in place could help in such scenarios, but still it is better to verify if the operational state did not change after the change was introduced and not wait for the alarms to go off.

Aiding operational reliability

We have already mentioned that a network or infrastructure constantly changes, and this is normal and expected behavior. What must be constant though, is the reliability of the infrastructure and the services it provides; both internal and external. There are different techniques and approaches used to assure high reliability of the network/infrastructure, and these are:

Redundant network/infrastructure design allows for retention of operability even in a degraded state, with zero or minimal impact on the customers and services. Usually, however, a network or infrastructure is designed to sustain a single failure, not multiple. This will also not protect it from human/configuration error.
Prior deployment service/feature tests means that every new service and feature is thoroughly tested in a lab environment before it is deployed to production. The testing can be using physical hardware or with the help of a Digital Twin.
Configuration unification is an approach to configuring a device (or its software-based representation). At its core, the goal is to make sure that the same feature or service is configured exactly the same way everywhere. Of course parameter values change, but the “template” is unified. This approach ensures that whatever is configured has already been checked in practice and is working. Configuration unification is also one of the first steps to achieving network/infrastructure automation.
Making changes during maintenance windows is a process that is rigorously followed in all large organizations and it means that all impacting (or potentially impacting) change operations are performed only during hours of least infrastructure usage. Performing any kind of work even during a maintenance window requires approval. The maintenance window is usually somewhere between 1 am and 4 am in the morning, but that is dependent on the organization and the type of business. As the change may be unsuccessful, time for a rollback should also be taken into account.
The goal of pre- and post- implementation checks is to make sure that the changes introduced do not cause any lasting service disturbance or problems. The process as a whole consists of two sets of checks or measurements: before the change and after. The pre-check aims at two things: firstly, to verify if a network/infrastructure is healthy and ready for a change; i.e. there are no issues ongoing that can impact the change, so the process can be stopped before it is even started. The second goal is to have data after the modification is introduced to compare with the results gathered during post-check tasks.
Change history tracking is needed to be sure of what has happened and when, and for documentation purposes. We can look at the state of the network before the change and after the change and everything may look okay, but if someone reports that something is not right then we have access to data showing exactly what the status was before the change was implemented. For example, in the case of BGP we can say that change did not affect the traffic because some specific prefix wasn't received before the change started. Also, if changes are frequent, having access to all pre- and post-checks for a period of a few weeks or months helps with investigating issues, giving access to historical data that we can analyze to fix the problem.
AI-based anomaly detection helps with revealing events or patterns that should not be there. The process is similar in its goal to pre- and post-checks - to make sure all works as expected - but it uses different tools and methodologies, and is continuous; contrary to pre- and post- implementation checks which are done on request. To read more about anomaly detection, visit our AI and Machine Learning for Networks: classification, clustering and anomaly detection blog post.
AI-based root cause analysis (RCA) is used in situations when an issue in the network/infrastructure has already occurred. When using AI-based RCA, an operator is assisted by the software to search for changes introduced prior to the issue, known issues that may have surfaced with the existing configuration in place, and checks that take place automatically. The software may also suggest the potential cause of the problem or recommend the next steps to be taken to identify it.

Among all of the above, pre- and post implementation checks are the ones done in a direct relation to the change and as such they can be tuned to verify if the change was truly successful or not. Let’s see what should be checked.

What should and can be verified

When it comes to introducing a change to a network or infrastructure it is crucial to make sure that this change did not cause any lasting issues. A way to do so is to check whether the intended configuration has been applied correctly and that the operational state of the network or infrastructure remained unchanged after the modification in areas there were not supposed to be impacted. Although each situation is different and there is no general rule as to what should be verified, some things are common:

Verification that the change date and time has been approved with change management authorities, and the ‘go’ decision is still in effect.
A network/infrastructure health check before any change is applied helps to decide whether to proceed with the changes or not. It is okay to start modifications as planned if all is well, but it probably is not a good idea to add more stress to a network that is already experiencing problems. Unless of course the reason for the upcoming change is to fix the existing issue.
Before any modifications are to be performed, it is necessary to gather as much information about the current state of the network as is reasonably possible during a pre-check verification. This will vary based on the type of device. For example, for a router, one may want to have a snapshot of ISIS adjacencies, BGP sessions (up and down), number of routes, traffic throughput, RIB and FIB size, and so on. Of course, in some cases the numbers will not be the same and that is okay (like the number of routes in full feed advertisements received), but in general it is still very helpful to have data to compare. If possible, pre- and post-checks should be standardized. Every person involved in carrying out the change should know what the minimum set of checks per device type is. Depending on the change, additional commands could be executed, but having a list of the checks that always need to be performed helps.
When a change has been introduced, an intended configuration vs actually applied configuration check is a must. There are situations where for some reason (even as simple as a parser error) the configuration was not applied, or in a worse case, was applied partially. In the latter scenario, the resulting behavior can be erratic and quite difficult to diagnose afterwards.
Post-check verification is a complementary check to a pre-check verification - it should get the same set of snapshots as were done before the change and compare the results. Any (unexpected) discrepancies between these two should alert the operator to the need to take a deeper look at the current state of the environment. Post-check verification must include a health-check overview of the whole infrastructure (the same as in the second point of this list) to make sure that our local changes did not affect any remote services.
Logs are also an important and very useful source of information. These should always be checked, even if all of the above shows no error. If there were warning messages, these should be taken care of.
Supervision, where all of the above steps are performed by one person with a second person supervising, make human errors less likely.

Most of the tasks mentioned above can be automated. Today, let’s pay special attention to pre- and post-checks, and why automating them can save time and improve reliability.

How automation can help

Pre- and post-checks are snapshots of the operational state of the device that reflect not only the device itself but also the state of the surrounding environment. The intention is to combine pre- and post-checks to verify whether the modification was successful or not. However, as the network/infrastructure changes constantly, and the number of checks that are useful even for a relatively simple configuration change is large, doing it manually takes a lot of time.

Moreover, if the time between the first and the last snapshot is too long it is quite possible that the state of the environment at the beginning is not the same as at the end. This is not a desirable situation; ideally one would like to have a snapshot from exactly the same moment in time, on all participating entities. Such a thing clearly cannot be achieved manually. This is where automation comes into play.

There are two main types of tasks one can automate: configuration tasks and operational tasks. Among the operational tasks are tasks dedicated to some maintenance work (like upgrades or applying patches) and tasks that gather network/infrastructure information. The latter can be used to fulfill the needs of pre- and post-checks.

Pre- and post-checks are in reality a set of show commands (or their equivalent) that tells the operator how the network/infrastructure looks from a device point of view. Taken from different devices (or software entities, as a network function may be used in software form) it gives reasonably complete information about the state of the environment. What is important is gathering the information as quickly as possible, and in parallel.

This can be achieved with different methods, depending on what the device/entity supports. It can be an Ansible playbook or an API call implemented in a Python script. The method itself is not that important, the outcome is. And even here the way of presenting the information does not matter as long as the format from pre- and post-checks is the same (and has the required information): it can be a CLI-like output, a JSON file, an XML or database entry - as long as these are the same type and content for the pre- and post-check result, it is fine. However a JSON/XML output might be easier to be parsed by scripts in an automated environment as the CLI output emphasizes human readability.

The next step in automating pre- and post-implementation verification is automating the comparison of before and after snapshots. If there was one, an operator could do it manually, but with many, lengthy snapshots in multiple formats, manual comparison could take a long time (not to mention being prone to error), and the whole idea is to have a quick answer about the success or failure of the change. In such cases, the comparison of the outputs must be automated. And again, it does not have to be very complex, sometimes even a simple diff will suffice, but it must be in an automated way.

But how to be sure that the snapshots will be taken correctly and that what is planned will cover all the verification needs? This is where having a Digital Twin can help. Discover the additional benefits that network automation as a service can bring to your organization.

We also encourage you to watch the panel discussion, where industry experts share their thoughts and insights on the network automation journey:

Using a Digital Twin

Having a lab where one can test new services, features and different deployment scenarios is a blessing. However, having a physical lab is not always possible, and for sure not to the scale it sometimes needed. In such cases a Digital Twin is extremely useful. It is no different with pre- and post-check development.

When preparing for a network change one knows exactly what is to be done, which devices or services it involves, and what can be potentially impacted and how. Therefore a set of checks can be defined to be sure that all these potentially sensitive areas are covered during the verification phase. However, the development process is also prone to errors and it is necessary to check the verification tools before they are actually used.

With a Digital Twin in use, one can first simulate the production environment (or the relevant part of it) and then simulate the planned change using that simulation. The verification tests run before and after the change can show if the checks offer sufficient cover and are correct. If something is missing they can be modified to fulfill the requirements. Moreover, with such a simulation, parallelism can also be verified: in a production environment the checks need to be run on multiple devices at the same time, and the same can be simulated with a Digital Twin (and hardly ever with a physical lab, mainly because there are not so many boxes available), to assure that it is doable and correct.

In other words a Digital Twin is as valuable for testing new features and services as for testing the tooling for management and maintenance of the network/infrastructure. It is also extremely helpful when working with automation, pre- and post-checks being no exception.

If you’re looking for more information about the Digital Twin concept, check out our previous publication about the practical applications of Digital Twins in modern networking.

Conclusion

This article is dedicated to pre- and post-check validation: how important they are and how automating them can be of value. This is a part of the Network Automation Journey series, where we cover different aspects of network and infrastructure automation.

Furthermore, this article is intentionally focused on the operational state of the network and infrastructure verification, as this aspect is often overlooked when talking about operational tasks.

We want to emphasize that checking and ensuring the health of an infrastructure is equally doable in an automated way as well as in basic invasive operations like upgrades.