2 September 2020

Network infrastructure planning

The modern, interoperable DC - Part 1: Solving "last mile" problems with BGP (video)

74 minutes reading

The modern, interoperable DC - Part 1: Solving "last mile" problems with BGP (video)

This video is a part of our webinar series "The modern, interoperable DC", which walks you through connectivity challenges between different types of resources.

Part 1: >>Solving “last mile” problems with BGP<< will guide you through a solution for DC connectivity based on a combination of FRR, unnumbered BGP (IPv4 with IPv6 link-local NH) and eBPF. This mix produces an automated discovery of bare metal and XaaS environments and can be run on any COTS as it uses only open-source and standardized features.

In this video we explain:

  • The evolution of the data center and its impact on the last mile for operations

  • How to handle the growing number of devices and configurations needed

  • How to create a proper automated discovery in a data center using:

    • Unnumbered BGP (IPv4 with IPv6 link-local NH) - RFC 5549
    • Free Range Routing (with BGP neighbor auto discovery enabled:
    • Juniper Networks and Arista network equipment
  • The type of issues you might encounter and how to overcome them

We will also talk about eBPF, BFD, ECMP, Spine&Leaf topology, Juniper Python automation as all those things will play a major role here.

Source code, topology, configurations used during this presentation are available at Github repo.

Adam

Hello everyone! We would like to welcome you to the first webinar in our three-part series, where we will cover the concept of a modern and interoperable data center. My name is Adam

Jerzy

and my name is Jerzy.

Adam

And we both work at CodiLime as the network engineers specializing in these environments. We hope you will enjoy our presentation.

First, a few words about our company. CodiLime was founded in 2011 and now we have about 200 people on board and we are still growing. While no longer a startup, we continue to keep its spirit - a culture of agility, innovation and adaptability. Most of our team is located in Warsaw and Gdańsk in Poland. However, as we cooperate mainly with clients in the US, we always have a part of our teams visiting our clients there.

So all that said, we frequently work with modern DC deployed in Spine&Leaf fashion, often with some sort of SDN on the top of it. Having some experience with older DC caused us thinking: is this architecture final or can it be improved? What are the biggest issues right now? Today's presentation is all about that. While a lot of technologies will be introduced, don't be afraid. They are not new. We are just doing one step further in evolution. We are not proposing any major changes.

Jerzy

That's right. And in this first webinar, we will focus mainly on the building blocks of a flexible, very scalable and easy to automate data center. And this might surprise you, but one of these building blocks will be IPv6 and we will leverage IPv6 addressing on the links between the networking devices, but also on the links between servers and the top of the switches. And while using IPv6 addresses, we'll still be able to advertise information about IPv4, and IPv4 will be enabled for use by our services, users, and by our applications.

In order to advertise information about available IP addresses in the data centers within a set data center, we will be using a dynamic routing protocol which in our case will be BGP. Now, BGP is quite commonly used in large data centers. However, this is the most often enabled only on the networking devices. And we will also enable it and run it on the servers themselves. And we will try to explain what are the benefits of such an approach.

OK, so in order to run a BGP protocol on a server, we need some kind of routing daemon and we've chosen FRR, which is short for Free Range Routing and it is an open-source software.

So having this FRR, we will also show you how we can install and configure it automatically on each new server that is being added to the data center. And at the end of the presentation we will also show you a demo of a working solution. So, we have a simple lab topology which is presented here on the right box in the diagram where we have three switches and we also have two servers with the FRR routing daemon, as well as one legacy server that doesn't have any extra routing software installed and that has IPv6 traditionally configured on its physical interfaces instead of a loopback interface. Now, if anyone iswill be interested in the technicalities of this solution, we are going to put all of these configurations from the switches and from the servers on our GitHub page, and the link will be there at the end of the presentation.

Also, if you have any questions during the presentation, feel free to put them in the YouTube chat window. OK, now, before we continue to the main part of the presentation, a short teaser of what we can build when we are running BGP on our networking devices as well as on our servers. And this will actually be covered on the two webinars that are still to come. So, for example, we can run EVPN, a short for Ethernet VPN with the excellent Powerlink. This will allow us to interconnect heterogeneous resources such as, for example, legacy servers without any routing daemons, such as containers such as virtual machines.

And this interconnection will allow for a Layer 2 connectivity. So they will be able to communicate as if they were connected to the very same switch. We will also be able to provide multi-tenancy with virtualization and VRFs and will also be able to extend this Layer 2 connectivity to resources that are located outside the data center. So they might be located, for example, in some kind of public cloud, or they might be located on some edge servers in edge computing use case or on IoT devices.

And the communication will be encrypted so it cannot be eavesdropped upon when being sent through a public network such as the Internet. So itthis was just a short teaser. And with that, let's continue through the main part of our presentation today.

Adam

However, before we go any further, let's take a few steps back to gain perspective. Let's take a small trip down memory lane and go back to the year 2003 where the switches were simple and they were purely Layer 3 devices. This distinction at this time was strongly affecting how networks were designed. On top here we had Layer 3 routers that were forwarding all the traffic between flat Layer 2 networks, and all the logic was placed on the top. Switches were very simple.

So a lot of Layer 2 issues existed at the time. Since routers were purely Layer 3 devices, most of the time each network was consuming one router port. This was affecting scalability at some point. Also at the time, networks were designed in a purely active/standby approach. This was forced by STP protocol as well as the VRRP protocol that were used at this time. Now we are moving to 2010.

Layer 3 switches are becoming more popular and more vendors begin to share that on the IT market. Since the price of one routing port went down significantly and the VLAN routing interface became common, the network design adopted as well. On top, we have big modular Layer 2-Layer 3 devices doing switching and routing at the same time. On each access switch, we are connecting the networks using VLAN trunk and each switch was able to terminate any network available in DC.

So at least, some part of the work was moved down the chain. However, we are dealing with the multi vendor equipment. We're still limited to active/standby design. Still, all the routing was done on the core switches and, as a result, the saturation of the uplinks was becoming a problem at some point, if you wanted to scale up. The only way was vertical scalability, which means faster uplinks, better core devices. This was pricey and time consuming. Now we are ending this short journey and we are arriving at the present times as well to the Spine&Leaf topology.

Today, the Layer 2 and Layer 3 devices are popular, small and not so expensive. As a result, we can put routing device at the edge. This allows us to create pure Layer 3 backplane with almost no Layer 2 traffic between network devices. To save uplink bandwidth, all routing between networks can be done on leafs instead of spines. Finally, in some cases, edge can relieve core from most of its work.

Since there is no vendor-specific protocol in this design, a fully Active/Active approach can be implemented without any vendor lock up. Those previous slides show us two things. First, in every step, a little bit of work was taken from the core devices and placed on the edge. The second thing was not shown at all. That's our point as well. Each time we're building the perfect network, the time being, but we're skipping the so-called last mile each time. So tThe term "last mile" was originally used in the telecommunications industry to describe the difficulty of connecting end user's homes and businesses to the main telecommunication network.

This term was also used in logistics and transportation planning to describe the movement of people and goods from the transportation hub to the final destination. So, here in our presentation, we are using the "last mile" to describe a link between the leaf and the server. As you probably remember, the servers were missing each time from the design. Why is that? Well, often they're managed by the different people and the people creating networks are not one of them. So that distinction is causing a lot of issues.

As the network server departments are often placed in silos, a simple change request between the network and the server, such a VLAN termination, can take weeks. A ticket must be created, accepted, processed, etc. The need of termination - this is the other issue. The need of termination of all network and services on leaves creates a complication configuration there. BGP policies, VLANs, VTEPs, VRFs and so on must be configured and placed there. Networks are stretched on Layer 2 towards from the switch to the server, so using Layer 2 with its own problem is still configured and is still present on the switches.

This is an issue for us as well. Due to LACP limitation, a failure detection on the server uplinks is slow, 90 seconds in slow mode, 3 seconds in the fast mode. But even with all that, we cannot connect servers to the multiple switches in Active/Active manner without an EVPN already deployed in the data center. The other issue is that we still need to keep track and maintain the IPv4 address space, which means in the medium-sized data center we need to allocate over 200 prefixes just to connect the switches between itself.

Not to mention, we need to have the DHCP in redundancy running or other IP management software for the servers. At last, to introduce a service load balancing or redundancy, we need to depend on either the external services such as H proxy or some kind of external hardware. So what steps can be taken to remedy those issues?

Jerzy

OK, so in our presentation we will show several such steps that lead us to the working solution. And now we'll take time to present to each one of them. So first off, the IPv6 protocol, which has several advantages over IPv4. And one of these advantages is its capability to automatically assign a so-called link-local address to every interface which has IPv6 enabled. All the local addresses can be generated randomly.

However, it is usually based on the MAC address of the physical interface. So, here on the diagram, we do have a MAC address on an interface and we can clearly see that this link-local address is based on this MAC address. It is possible to have the same link-local address configured on different interfaces of a single device. And this often happens when we, for example, have the physical interface with several VLANs subinterfaces and each of them can share the same link-local address. Now, why is it important for us? Why are we using this in this approach?

Well, if we were to use IPv4, we would have to create a unique subnetwork for each of the connections between the switches and between the switches and servers, and make sure not to duplicate the IP addresses, make sure that they are configured correctly. And this could be quite a lot of work. Now, thanks to IPv6, all of these links are automatically addressed and we do not need to further configure them. We just need to enable IPv6 on the interfaces. And another thing which is important for us, the routing protocols such as BGP can use this automatically assigned link-local addresses to exchange the information about the IP addresses configured in our data center.

Another important thing that you can notice on the slide is the Neighbor Discovery Protocol, but it's also part of the IPv6 standard and it allows the devices that are acting as routers to periodically advertise information about the IP addresses configured on their interfaces as well as the MAC address assigned, which is mapped to this IP address. So the devices here server with this FRR routing daemon as well as the switch are able to learn about each other and put this information in the neighbor table, the IP address and the MAC address of the neighbor. Now, this is quite important for us when we consider how we want to establish a BGP session, how we want to enable this BGP protocol to advertise the routing information.

In a normal setup of BGP, we would actually need to manually set the Neighbor's IP address as well as the so-called autonomous system number in order for the BGP session to be established. However, because we do have this information about the neighbor's IP address, we can leverage a mechanism which is supported by FRR. It is also supported by Arista, Cumulus Networks as well as Dell and this mechanism allows us to automatically establish a BGP session between neighbors not discovered to themselves using this Neighbor Discovery Protocol. Now keep in mind that this feature, unfortunately, is not supported by all of the vendors.

So in some cases, this might require extra steps. And in our case, we do have a switch in the lab topology where we created a Python script, which will be able to detect changes in the neighbor table and automatically create appropriate configuration for a new BGP session. Now, when the BGP session is established, our devices can start to advertise the information about the IP addresses that they know about.

However, we have IPv4 addresses here and the connection between the devices is in IPv6. This was a problem some time ago. However, there is an extension to BGP protocol, which is RFC5549 that does allow this kind of connectivity. So BGP is able to advertise IPv4 prefixes with IPv6 next hops. This RFC is already 11 years old and is supported by many of the leading networking vendors. So the switch will advertise the IP address using a BGP protocol.

Our server will receive this message and put information in the routing table that this network is reachable. And if we need to send packets to this network, we should forward them to this IPv6 address. Internet is mapped to this physical address of the neighboring switch So thanks to that, it is perfectly possible to have a networking core which is configured with IPv6 only addresses, while the services, applications and users can still use IPv4.

One more thing that I want to show on the slide is that we have configured the IP address for the applications on the loopback interface instead of a physical interface. The reason for that is that if a physical interface goes down, the IP address also becomes unreachable and potentially the applications and services which are using this IP address. Now, by putting the IP address on a loopback interface, which is always up and running, we ensure that as long as the server is connected to at least one of the switches in the network, this IP address will still be reachable, available in the data center and the users will be able to use the service that is using this IP.

So this is obviously especially important where we have a topology, where we've got redundancy. And another thing about BGP protocol is its capability to find more than one path to the destination network. So in this example, we've got this subnetwork over here and we can see that it is reachable through two interfaces and the cost of this path, so destiny and amount of devices that need to be traversed in order to get to the destination network is the same. So BGP protocol is able to put both of these paths in the routing table simultaneously, so our server will know that it can reach the destination network for both Switch 1 and Switch 2.

And it will be able to fast forward to the packets to either one of them. And basically this is load balancing. This type of load balancing in the networking world is called equal cost multi path and it is load balancing that uses per session load balancing. So not per packet, but per session. The sessions are split based on a hash that is calculated based on the values that can be found in the packet header. So usually these values are source IP, destination IP, protocol, so, for example, ICMP, UDP, TCP, and if the packet header includes that as well, source port and the destination port. If we would, for example, want to add more throughput between our server and the switches, we might just add a new connection.

The BGP will automatically establish a session on this connection and add a new next hop to the routing table for better load balancing. Now what would happen if one of the links or one of the switches went down? BGP has a built-in feature that does allow to detect such failures and when using the default timers. The failover should be detected and executed within about 90 seconds time. We could reconfigure that and go down to three seconds for failover. However, in the case of a data center, this is usually still not fast enough because there might be some applications running which have their own high availability mechanisms.

And we would like for the network to failover so fast that these mechanisms do not even notice that there was some kind of problem in the network. And we can achieve that using a bi-directional forwarding detection protocol, which is another open interoperable standard. It is a simple protocol for basically detecting if the remote device is reachable. It can cooperate with various mechanisms and one of them is the BGP routing protocol. But most importantly, the keepalives that we can configure for BFD can go way down below one second.

In the case of FRR, the minimum value for keepalives is 10 milliseconds, so potential failover can happen within 30 milliseconds time. In case of switches, it might depend on the vendor and on the model. Some switches might allow milliseconds keepalives, while some might allow 300 milliseconds keepalives. And there are also devices which are capable of hardware acceleration for BFD. And in such case, the keepalives can be set as low as 3 milliseconds. So either way, we have a very good mechanism to detect failures and in case of any failure actually happening, the traffic will be very quickly, rapidly rerouted to other available paths.

Adam

So, in our solution that we will present in the next few slides we will use Free Range Routing software or FRR in short. FRR is a network routing software supporting OSPF, BGP, IS-IS, LDP and many others protocols. Unlike the BIRD alternative, the FRR fully supports a VPN with full forwarding plane integration. Prefixes, MACs, VRFs—all can be sent either to the kernel or DPDK forwarding plane to allow traffic flow accordingly to the control plane directives. FRR was forked from the software Quagga, well known by many people, as the pace of the Quagga development was frustrating to some of the developers. Currently, the FRR contributors include Cumulus Networks, 6Wind, BigSwitch Networks and many others. Developers respond promptly. And one of our issues that we encountered during the demo was sourced in less than two weeks. FRR is also a collaborative project of the Linux Foundation. So, we mentioned FRR as a router. This seems, at first, as a complicated thing to consider and install on the server. However, there are already various tools that allow us to customize operating system installation, for example, MaaS, Cobbler, Foreman, Cloud-init, etc. As the FRR itself is available as a Linux package, it can be easily included in this process. As for the configuration, we can use one common template. The only difference between each bare metal is the IP address, which is one line, the IP address of the loopback interface. Everything else is the same. We can see that line in the orange on the current slides. This process of changing and assigning one IP address can be automated using tools mentioned before or using just plain Ansible and Jinja template. With all that in mind, we can start collecting all the pieces together.

Jerzy

Yes. So now that we have all the pieces of the puzzle, let's see how they fit. So here on the slide we have a topology where the leaf switches as well as Server2 are already configured. The BGP is up and running and we can see that the informat