Promises of SONiC Network OS

2 September 2020

Network infrastructure planning

The modern, interoperable DC - Part 1: Solving "last mile" problems with BGP (video)

74 minutes reading

The modern, interoperable DC - Part 1: Solving "last mile" problems with BGP (video)

This video is a part of our webinar series "The modern, interoperable DC", which walks you through connectivity challenges between different types of resources.

Part 1: >>Solving “last mile” problems with BGP<< will guide you through a solution for DC connectivity based on a combination of FRR, unnumbered BGP (IPv4 with IPv6 link-local NH) and eBPF. This mix produces an automated discovery of bare metal and XaaS environments and can be run on any COTS as it uses only open-source and standardized features.

In this video we explain:

  • The evolution of the data center and its impact on the last mile for operations

  • How to handle the growing number of devices and configurations needed

  • How to create a proper automated discovery in a data center using:

    • Unnumbered BGP (IPv4 with IPv6 link-local NH) - RFC 5549
    • Free Range Routing (with BGP neighbor auto discovery enabled:
    • Juniper Networks and Arista network equipment
  • The type of issues you might encounter and how to overcome them

We will also talk about eBPF, BFD, ECMP, Spine&Leaf topology, Juniper Python automation as all those things will play a major role here.

Source code, topology, configurations used during this presentation are available at Github repo.


Hello everyone! We would like to welcome you to the first webinar in our three-part series, where we will cover the concept of a modern and interoperable data center. My name is Adam


and my name is Jerzy.


And we both work at CodiLime as the network engineers specializing in these environments. We hope you will enjoy our presentation.

First, a few words about our company. CodiLime was founded in 2011 and now we have about 200 people on board and we are still growing. While no longer a startup, we continue to keep its spirit - a culture of agility, innovation and adaptability. Most of our team is located in Warsaw and Gdańsk in Poland. However, as we cooperate mainly with clients in the US, we always have a part of our teams visiting our clients there.

So all that said, we frequently work with modern DC deployed in Spine&Leaf fashion, often with some sort of SDN on the top of it. Having some experience with older DC caused us thinking: is this architecture final or can it be improved? What are the biggest issues right now? Today's presentation is all about that. While a lot of technologies will be introduced, don't be afraid. They are not new. We are just doing one step further in evolution. We are not proposing any major changes.


That's right. And in this first webinar, we will focus mainly on the building blocks of a flexible, very scalable and easy to automate data center. And this might surprise you, but one of these building blocks will be IPv6 and we will leverage IPv6 addressing on the links between the networking devices, but also on the links between servers and the top of the switches. And while using IPv6 addresses, we'll still be able to advertise information about IPv4, and IPv4 will be enabled for use by our services, users, and by our applications.

In order to advertise information about available IP addresses in the data centers within a set data center, we will be using a dynamic routing protocol which in our case will be BGP. Now, BGP is quite commonly used in large data centers. However, this is the most often enabled only on the networking devices. And we will also enable it and run it on the servers themselves. And we will try to explain what are the benefits of such an approach.

OK, so in order to run a BGP protocol on a server, we need some kind of routing daemon and we've chosen FRR, which is short for Free Range Routing and it is an open-source software.

So having this FRR, we will also show you how we can install and configure it automatically on each new server that is being added to the data center. And at the end of the presentation we will also show you a demo of a working solution. So, we have a simple lab topology which is presented here on the right box in the diagram where we have three switches and we also have two servers with the FRR routing daemon, as well as one legacy server that doesn't have any extra routing software installed and that has IPv6 traditionally configured on its physical interfaces instead of a loopback interface. Now, if anyone iswill be interested in the technicalities of this solution, we are going to put all of these configurations from the switches and from the servers on our GitHub page, and the link will be there at the end of the presentation.

Also, if you have any questions during the presentation, feel free to put them in the YouTube chat window. OK, now, before we continue to the main part of the presentation, a short teaser of what we can build when we are running BGP on our networking devices as well as on our servers. And this will actually be covered on the two webinars that are still to come. So, for example, we can run EVPN, a short for Ethernet VPN with the excellent Powerlink. This will allow us to interconnect heterogeneous resources such as, for example, legacy servers without any routing daemons, such as containers such as virtual machines.

And this interconnection will allow for a Layer 2 connectivity. So they will be able to communicate as if they were connected to the very same switch. We will also be able to provide multi-tenancy with virtualization and VRFs and will also be able to extend this Layer 2 connectivity to resources that are located outside the data center. So they might be located, for example, in some kind of public cloud, or they might be located on some edge servers in edge computing use case or on IoT devices.

And the communication will be encrypted so it cannot be eavesdropped upon when being sent through a public network such as the Internet. So itthis was just a short teaser. And with that, let's continue through the main part of our presentation today.


However, before we go any further, let's take a few steps back to gain perspective. Let's take a small trip down memory lane and go back to the year 2003 where the switches were simple and they were purely Layer 3 devices. This distinction at this time was strongly affecting how networks were designed. On top here we had Layer 3 routers that were forwarding all the traffic between flat Layer 2 networks, and all the logic was placed on the top. Switches were very simple.

So a lot of Layer 2 issues existed at the time. Since routers were purely Layer 3 devices, most of the time each network was consuming one router port. This was affecting scalability at some point. Also at the time, networks were designed in a purely active/standby approach. This was forced by STP protocol as well as the VRRP protocol that were used at this time. Now we are moving to 2010.

Layer 3 switches are becoming more popular and more vendors begin to share that on the IT market. Since the price of one routing port went down significantly and the VLAN routing interface became common, the network design adopted as well. On top, we have big modular Layer 2-Layer 3 devices doing switching and routing at the same time. On each access switch, we are connecting the networks using VLAN trunk and each switch was able to terminate any network available in DC.

So at least, some part of the work was moved down the chain. However, we are dealing with the multi vendor equipment. We're still limited to active/standby design. Still, all the routing was done on the core switches and, as a result, the saturation of the uplinks was becoming a problem at some point, if you wanted to scale up. The only way was vertical scalability, which means faster uplinks, better core devices. This was pricey and time consuming. Now we are ending this short journey and we are arriving at the present times as well to the Spine&Leaf topology.

Today, the Layer 2 and Layer 3 devices are popular, small and not so expensive. As a result, we can put routing device at the edge. This allows us to create pure Layer 3 backplane with almost no Layer 2 traffic between network devices. To save uplink bandwidth, all routing between networks can be done on leafs instead of spines. Finally, in some cases, edge can relieve core from most of its work.

Since there is no vendor-specific protocol in this design, a fully Active/Active approach can be implemented without any vendor lock up. Those previous slides show us two things. First, in every step, a little bit of work was taken from the core devices and placed on the edge. The second thing was not shown at all. That's our point as well. Each time we're building the perfect network, the time being, but we're skipping the so-called last mile each time. So tThe term "last mile" was originally used in the telecommunications industry to describe the difficulty of connecting end user's homes and businesses to the main telecommunication network.

This term was also used in logistics and transportation planning to describe the movement of people and goods from the transportation hub to the final destination. So, here in our presentation, we are using the "last mile" to describe a link between the leaf and the server. As you probably remember, the servers were missing each time from the design. Why is that? Well, often they're managed by the different people and the people creating networks are not one of them. So that distinction is causing a lot of issues.

As the network server departments are often placed in silos, a simple change request between the network and the server, such a VLAN termination, can take weeks. A ticket must be created, accepted, processed, etc. The need of termination - this is the other issue. The need of termination of all network and services on leaves creates a complication configuration there. BGP policies, VLANs, VTEPs, VRFs and so on must be configured and placed there. Networks are stretched on Layer 2 towards from the switch to the server, so using Layer 2 with its own problem is still configured and is still present on the switches.

This is an issue for us as well. Due to LACP limitation, a failure detection on the server uplinks is slow, 90 seconds in slow mode, 3 seconds in the fast mode. But even with all that, we cannot connect servers to the multiple switches in Active/Active manner without an EVPN already deployed in the data center. The other issue is that we still need to keep track and maintain the IPv4 address space, which means in the medium-sized data center we need to allocate over 200 prefixes just to connect the switches between itself.

Not to mention, we need to have the DHCP in redundancy running or other IP management software for the servers. At last, to introduce a service load balancing or redundancy, we need to depend on either the external services such as H proxy or some kind of external hardware. So what steps can be taken to remedy those issues?


OK, so in our presentation we will show several such steps that lead us to the working solution. And now we'll take time to present to each one of them. So first off, the IPv6 protocol, which has several advantages over IPv4. And one of these advantages is its capability to automatically assign a so-called link-local address to every interface which has IPv6 enabled. All the local addresses can be generated randomly.

However, it is usually based on the MAC address of the physical interface. So, here on the diagram, we do have a MAC address on an interface and we can clearly see that this link-local address is based on this MAC address. It is possible to have the same link-local address configured on different interfaces of a single device. And this often happens when we, for example, have the physical interface with several VLANs subinterfaces and each of them can share the same link-local address. Now, why is it important for us? Why are we using this in this approach?

Well, if we were to use IPv4, we would have to create a unique subnetwork for each of the connections between the switches and between the switches and servers, and make sure not to duplicate the IP addresses, make sure that they are configured correctly. And this could be quite a lot of work. Now, thanks to IPv6, all of these links are automatically addressed and we do not need to further configure them. We just need to enable IPv6 on the interfaces. And another thing which is important for us, the routing protocols such as BGP can use this automatically assigned link-local addresses to exchange the information about the IP addresses configured in our data center.

Another important thing that you can notice on the slide is the Neighbor Discovery Protocol, but it's also part of the IPv6 standard and it allows the devices that are acting as routers to periodically advertise information about the IP addresses configured on their interfaces as well as the MAC address assigned, which is mapped to this IP address. So the devices here server with this FRR routing daemon as well as the switch are able to learn about each other and put this information in the neighbor table, the IP address and the MAC address of the neighbor. Now, this is quite important for us when we consider how we want to establish a BGP session, how we want to enable this BGP protocol to advertise the routing information.

In a normal setup of BGP, we would actually need to manually set the Neighbor's IP address as well as the so-called autonomous system number in order for the BGP session to be established. However, because we do have this information about the neighbor's IP address, we can leverage a mechanism which is supported by FRR. It is also supported by Arista, Cumulus Networks as well as Dell and this mechanism allows us to automatically establish a BGP session between neighbors not discovered to themselves using this Neighbor Discovery Protocol. Now keep in mind that this feature, unfortunately, is not supported by all of the vendors.

So in some cases, this might require extra steps. And in our case, we do have a switch in the lab topology where we created a Python script, which will be able to detect changes in the neighbor table and automatically create appropriate configuration for a new BGP session. Now, when the BGP session is established, our devices can start to advertise the information about the IP addresses that they know about.

However, we have IPv4 addresses here and the connection between the devices is in IPv6. This was a problem some time ago. However, there is an extension to BGP protocol, which is RFC5549 that does allow this kind of connectivity. So BGP is able to advertise IPv4 prefixes with IPv6 next hops. This RFC is already 11 years old and is supported by many of the leading networking vendors. So the switch will advertise the IP address using a BGP protocol.

Our server will receive this message and put information in the routing table that this network is reachable. And if we need to send packets to this network, we should forward them to this IPv6 address. Internet is mapped to this physical address of the neighboring switch So thanks to that, it is perfectly possible to have a networking core which is configured with IPv6 only addresses, while the services, applications and users can still use IPv4.

One more thing that I want to show on the slide is that we have configured the IP address for the applications on the loopback interface instead of a physical interface. The reason for that is that if a physical interface goes down, the IP address also becomes unreachable and potentially the applications and services which are using this IP address. Now, by putting the IP address on a loopback interface, which is always up and running, we ensure that as long as the server is connected to at least one of the switches in the network, this IP address will still be reachable, available in the data center and the users will be able to use the service that is using this IP.

So this is obviously especially important where we have a topology, where we've got redundancy. And another thing about BGP protocol is its capability to find more than one path to the destination network. So in this example, we've got this subnetwork over here and we can see that it is reachable through two interfaces and the cost of this path, so destiny and amount of devices that need to be traversed in order to get to the destination network is the same. So BGP protocol is able to put both of these paths in the routing table simultaneously, so our server will know that it can reach the destination network for both Switch 1 and Switch 2.

And it will be able to fast forward to the packets to either one of them. And basically this is load balancing. This type of load balancing in the networking world is called equal cost multi path and it is load balancing that uses per session load balancing. So not per packet, but per session. The sessions are split based on a hash that is calculated based on the values that can be found in the packet header. So usually these values are source IP, destination IP, protocol, so, for example, ICMP, UDP, TCP, and if the packet header includes that as well, source port and the destination port. If we would, for example, want to add more throughput between our server and the switches, we might just add a new connection.

The BGP will automatically establish a session on this connection and add a new next hop to the routing table for better load balancing. Now what would happen if one of the links or one of the switches went down? BGP has a built-in feature that does allow to detect such failures and when using the default timers. The failover should be detected and executed within about 90 seconds time. We could reconfigure that and go down to three seconds for failover. However, in the case of a data center, this is usually still not fast enough because there might be some applications running which have their own high availability mechanisms.

And we would like for the network to failover so fast that these mechanisms do not even notice that there was some kind of problem in the network. And we can achieve that using a bi-directional forwarding detection protocol, which is another open interoperable standard. It is a simple protocol for basically detecting if the remote device is reachable. It can cooperate with various mechanisms and one of them is the BGP routing protocol. But most importantly, the keepalives that we can configure for BFD can go way down below one second.

In the case of FRR, the minimum value for keepalives is 10 milliseconds, so potential failover can happen within 30 milliseconds time. In case of switches, it might depend on the vendor and on the model. Some switches might allow milliseconds keepalives, while some might allow 300 milliseconds keepalives. And there are also devices which are capable of hardware acceleration for BFD. And in such case, the keepalives can be set as low as 3 milliseconds. So either way, we have a very good mechanism to detect failures and in case of any failure actually happening, the traffic will be very quickly, rapidly rerouted to other available paths.


So, in our solution that we will present in the next few slides we will use Free Range Routing software or FRR in short. FRR is a network routing software supporting OSPF, BGP, IS-IS, LDP and many others protocols. Unlike the BIRD alternative, the FRR fully supports a VPN with full forwarding plane integration. Prefixes, MACs, VRFs—all can be sent either to the kernel or DPDK forwarding plane to allow traffic flow accordingly to the control plane directives. FRR was forked from the software Quagga, well known by many people, as the pace of the Quagga development was frustrating to some of the developers. Currently, the FRR contributors include Cumulus Networks, 6Wind, BigSwitch Networks and many others. Developers respond promptly. And one of our issues that we encountered during the demo was sourced in less than two weeks. FRR is also a collaborative project of the Linux Foundation. So, we mentioned FRR as a router. This seems, at first, as a complicated thing to consider and install on the server. However, there are already various tools that allow us to customize operating system installation, for example, MaaS, Cobbler, Foreman, Cloud-init, etc. As the FRR itself is available as a Linux package, it can be easily included in this process. As for the configuration, we can use one common template. The only difference between each bare metal is the IP address, which is one line, the IP address of the loopback interface. Everything else is the same. We can see that line in the orange on the current slides. This process of changing and assigning one IP address can be automated using tools mentioned before or using just plain Ansible and Jinja template. With all that in mind, we can start collecting all the pieces together.


Yes. So now that we have all the pieces of the puzzle, let's see how they fit. So here on the slide we have a topology where the leaf switches as well as Server2 are already configured. The BGP is up and running and we can see that the information about the IP addresses is already exchanged and available in the routing tables. Now we get a new server, we take it out of the box and we put it in a server up. And then we connect the cables to appropriate interfaces and we push the power on button.

The server will get the operating system image from the provisioning server and it will also receive some configuration scripts. And the scripts are there, for example, to make sure that the IPv6 is enabled on the Ethernet 1 and Ethernet 2 interfaces and that FRR is installed and configured. And they will also make sure to put appropriate IP address and a unique one to the loopback interface of this new server. And within minutes of powering on the server that is up and running, it will start to advertise its link-local IPv6 addresses using the Neighbor Discovery Protocol.

And will also receive such advertisements from both Leaf1 and Leaf2 switches and it will be able to automatically establish a BGP session. Now it is important to keep in mind that all of this happens without any intervention from the networking guys. No configuration changes are needed on the networking devices in order for such new servers to automatically connect to the rest of the network. So as soon as BGP is up and running, the switches will receive information about the loopback address of the server and they will put this information in the routing table. They will also take this information and forward it to every other BGP neighbor that they have configured. So in our case, it is just Server2 on this small example topology.

However, in the actual data center, this prefix would be forwarded to all of the other switches and all of the other servers that are running BGP protocols. And these devices will very quickly learn about this new IP address that has been added to the network. So we can see that Server2 indeed received this information from Leaf2 and Leaf1 and that has put it in the routing table, knowing that it can forward the packets to the destination through both of these switches, so it can do the load balancing.

And this exchanging of information will also take place in the opposite direction. So the leaf switches will also advertise the prefixes that they know to the new server. So the new server will again very quickly learn about all of the IP addresses configured throughout the data center. So the devices will be able to communicate with each other. They will use a load balancing and in case of any failure, the traffic will rapidly fade over to other available paths.


Now it's time for a simple demo. The demo itself is divided into five parts. First , we will show the BGP autodiscovery that we mentioned earlier, where we will show how a newly added BMS (bare metal server) will become visible to the rest of the network topology. Then we will present a simple traffic flow between two bare metal servers equipped with the FRR. Since this solution, as we mentioned before, is only a revolution, not the revolution, we will show that the access from the legacy server equipped with the IPv4 address only is still possible.

We mentioned fast failover before and we will show it here, as well in the next part. If the time allows, in the last part we concentrate on the IPv4 load balancing using anycast. But first, let's start with the simple topology presentation. Here we have six important parts of the topology. On the bottom, we have two bare metal servers, each one equipped with the FRR. On the right, we have the legacy bare metal server with no routing software and only one active uplink towards the first leaf on the left. The second uplink is not active, so it will not be used. In the middle, we have two leaves.

On the left, we have the Juniper device equipped with a loopback and the VLAN address to the legacy server, two links to the FRR bare metal legacy servers. On the top, we have the Juniper spine switch - there is no specific configuration data that is worth mentioning right now. And on the right we have Leaf2, also equipped with the loopback and the VLAN2 and connected to each of the servers as well. Having all that said, let's move to the demo.


OK, so here we've got four terminals. We've got a terminal for BMS1 server and for BMS2, and for both switches Leaf1 and Leaf2. Right now, in this topology BMS1 does not have FRR running yet. The routing 1 is disabled and we can see that the routing table of this server is pretty much empty when compared to the routing table of BMS2, which already knows about the IP addresses that are available in the data center. Now let's check the information from the switches and see what BGP neighbors do they see. All right. So we've got a Juniper switch which sees an established BGP session. And this is the session to BMS2 And it also has one more session configured. However, it is not yet established, it is in inactive status.

So, it tries actively to establish this BGP session. And this is the session to BMS1, which has this routing daemon disabled. Now let's see if this is the same on the Arista switch which commands outputs are a little bit different over here. We've got a link-local address neighbor. We see that there is an established BGP session and this is BMS2. We've got some sessions. This is actually a spine switch. However, there is no session to BMS1.

Again because the routing daemon is disabled there. Okay, so before we enable the routing daemon on BMS1, let's see the configuration of the routing software. And this configuration is about what neighbors have been configured for BGP. So we can see that we do have a configuration that we want to have neighbors on enp2 and enp3 interfaces. However, there are no IP addresses for these neighbors specified here. They will be automatically detected thanks to this Neighbor Discovery Protocol, so we do not need to put them there. OK, so let's start the routing daemon and see if anything happens over here.

OK, FRR is started and in our topology in this lab, we have said that the router advertisement, the Neighbor Discovery Protocol to advertise information about the IP addresses every thirty seconds, so it might take up to thirty seconds to establish this BGP session. So let's see on Leaf1 whether or not a new session is established. It is indeed and that is up for four seconds right now. OK, on the other switch, it should also be established. Let's see that again. We've got an extra here. We had one session to link-local address BGP neighbor and here we have two. So, so far so good.

OK, BMS1 according to the topology has an IP address of So let's see if this address was advertised using BGP protocol. So the show route does indeed show us that the leaf switch knows about this prefix. And it also knows that it can get to this destination using two different paths. The first path, which has one hop that is going through the ge-0/0/1 interface and then see this on the diagram. OK, so we've got this Leaf1 switch and the destination is reachable through 0/0/1 interface, indeed, one hop away. And this is the IP address configured on the loopback interface.

OK, so what is this second path? Let's see that We've got three hops away, AS path with three hops, and we are going through 0/0/0 interface. OK, let's go back to the diagram. So BGP automatically detected that this IP address is reachable through 0/0 interface and it is one, two, three hops. This is the alternative path that would be used in case of a failure of the quicker and shorter directly connected above. And the BGP already knows that it has this route prepared in case of some issues.

OK, now just to make sure, let's also see if this prefix is visible on Leaf2 switch. So we also are looking for the same loopback address. And again, we can see two paths, one shot and the second one a little bit longer. A backup path to this BMS1. OK, but we've also mentioned that BMS, a new server, should be able to receive all the other routes from our data center. So before we had only one route over here and right now after enabling BGP, we are checking the routing table and indeed we've got the default route with the next hops. So we've got load balancing and we've got the IP address of BMS2 server.

And on BMS2, let's check the routing table again that received information, the IP prefix from BMS1. So everything looks good, everything is as expected. Right. So having this information about the available IP addresses advertised, we can now try to check if the communication actually is possible between these loopback addresses. So here we are on BMS1 server and we want to ping the loopback address of BMS2. The simplest test of them all. And we can see, indeed, this does work. OK, so let's try something a little bit more complicated. Let's try creating TCP sessions. So we are going to open a Web server on port 18 and we are going to open a new connection every one second.

So what we are hoping to hear it will happen is that the traffic will be load balanced and we can check that on the topology right now. So we've got BMS1 and that is communicating with BMS2. So it can go through this path, which takes two hops, and then that can also go through this path, which is also two hops away. So it can load balance the traffic and we can try to check if this per session load balancing indeed works. So this is the output of the Web server that is located on BMS2. And right now, let's see if the traffic is received through an interface on the BMS2.

So what we are going to do, we are going to sniff packets. The DHCP utility shows us the packets arriving in this case on the enp2 interface. And we can see that indeed there is some Web server traffic scene over here. All right. But if the load balancing is working, the traffic should also be visible on the second interface connected to the second switch. And again, it does indeed work. So we're all good here as well. Now, what we would also like to make sure this happens is that we are able to connect from the legacy server, which does not have this FRR running. And that we can reach the same Web server on BMS2.

So again, we are creating a new TCP session every one second and let's see if it works. So the BMS server is reachable. However, there will be a difference here. Let's go back to the diagram and see how BMS2 is connected with legacy BMS. The legacy BMS uses only one link—the primary interface. The backup is shut down, so the connection to BMS2 can go directly through Leaf1. However, there is also an alternative backup path through a spine switch, Leaf2 switch and only then to the destination—BMS server.

Now, this backup path actually shouldn't be used because we do not want to unnecessarily utilize the uplink bandwidth to this spine device. BGP protocol should use these shorter paths and this is what happens in the default configuration that we are using. So let's check if that is true and if it does indeed work. So we've got these TCP sessions running every second. And again, let's check the packets arriving on the first interface enp2 and we can see, OK, we've got the Web server communication and now let's check the second interface. So this will be the enp3.

OK, that's just in case, right, a little bit longer, but we can see that there is no traffic. So that is actually the thing that we wanted to see. The traffic is not being forwarded through the longer path, it is going straight through Leaf1 to the destination—the BMS2 server. OK, so, so far so good. The next thing that we wanted to show is a failover scenario. We will try to create a ping probe every 0.1 second, so 10 times per second and this connection will go from BMS1 to the legacy server. So before we do that, let's check the topology again just to make sure we know what we are doing. And we've got the BMS1 which wants to send traffic to the legacy BMS.

So again, it will have to pass a shorter one through Leaf1 and a longer one that goes through Leaf1 to the spine switch to Leaf - sorry and the longer one that goes through Leaf2 to the spine switch through Leaf1 and only then to the destination. And normally, obviously, the shorter path will be used. However, we will disable the enp2 interface and check whether or not the traffic will be automatically rerouted to this longer path. So let's see what happens. We are issuing this ping command so 10 pings per second and we will also check which interface is being used for forwarding to the destination IP at this moment.

So right now the enp2 interface is being used. And what we are going to do right now, we are going to disable this enp2 interface. So let's do it. OK, and that's it. The traffic has failed over to the alternative path. Let's make sure. All right. We see that it is now being forwarded through the enp3 interface. Let's see how quickly it happened. So we've got 384 packets transmitted and we have received 374. So we missed 10 packets. So the failover lasted up around one second. Now RBFD configuration, says the failure to be about nine hundred milliseconds. So we are pretty much at what we have expected. One second is within the norm. All right. So let's put the things back up and let's now try to enable this interface and see if the traffic will come back to this shorter path, which is better for us. So right now, the BGP session will be again established between the BMS and the leaf switch.

And soon we should see that we are now forwarding back through interface enp2. And most importantly here, the traffic has been rerouted to this primary path without any lost packets. So it is also safe to connect new servers and to, you know, repair connections while the data center is normally operating. OK. And the last thing that we wanted to show here is something a little bit unusual. What we have here is this legacy BMS, which will be trying to open a webpage located on this IP address 0.250. And let's see where this IP address is located. So this is an IP address that is configured on BMS1.

However, it is also configured on the BMS2—the very same IP address. This is allowed. There is no problem with that. And this kind of addressing is called anycast addressing. and when the legacy server will try to forward packets to this destination, it will say, all right, according to the routing table, I should send them to Leaf2 and Leaf2 will see that this destination is available with the same distance through BMS1 one and also through BMS2. So what it will do, it will load balance the traffic between these two servers, a simple load balancing using routing protocols. So let's see if it does indeed happen.

OK, if a session arrives at BMS1, we will see BMS1 hostname. If it arrives at BMS2, we will see BMS2 hostname. So we can see that it does work as expected. We've got load balancing using so-called anycast addressing. OK, and this is it when it comes to the demo that we wanted to show you.


So now let's summarize what are the biggest pros and cons of our solution that we presented during this presentation and during the demo. First, we are strongly optimizing the change management, as said before, by removing the network and the server silos. Connecting a new server, terminating a VLAN or stretching a new network no longer has to take ages. This feature will become even more clear during the second webinar where we will concentrate on EVPN features on the server.

Second, failure recovery. In a production environment every second matters. With the BFD we can easily achieve subsecond link failover detection and recovery. Third, simplicity. We can have multiple server uplinks in active/active manner without EVPN involvement. Also, using BGP, we can easily have active/active or active/passive services without a third party load balancer. With pure Layer 3 data center we are no longer affected by Layer 2 issues. We have no spanning tree, no LACP, no BUM traffic, no Layer 2 loops. Having IPv6 link-local addressing we can limit the IPv4 addresses that we need to assign to each switch or each server.

With this solution we can have only 1/32 for the server and 1/32 prefix for each switch. We no longer have to deal with subnets, broadcast, gateways, DHCP. Since the leaf switches only route the unicast traffic, there is no need for VXLAN license on those switches, which means cheaper price. We moved all the work down to the edge servers. Again, more on that will be covered during the second webinar.

No VXLAN on the switches also means no service termination. This translates to the much simpler network device configuration. We no longer have to define VLANs, VFRs and have a simpler BGP configuration. We are still backwards compatible with this Spine&Leaf topology, so we can still have some servers connected in an old manner and we can do the upgrade in the small steps. Again, we are having the evolution, not the revolution.

The next argument was probably heard multiple times, but still we became IPv6 ready at this point. Last, on the FRR roadmap there is a feature called Flowspec, which is already partially implemented. As soon as it is finished, we will be able to use the BGP protocol to control the firewall on each server from one single controller. This is ideal for DDOS protection inside the data center.


Now, every solution that has advantages has also some disadvantages and it is only fair to mention them as well. So there's no hiding the fact that this solution is more complicated, at least when it comes to the server configuration, because we do have this extra component on the servers that we need to install, which is this FRR routing daemon.

Also, the system administrators should know at least a little bit about the BGP protocol in order to do basic troubleshooting. If there is a problem in the network in connectivity, this knowledge will have to quickly determine whether or not the problem is with the connectivity and we BGP protocol or maybe with something completely different.

Another thing is that BGP is a very flexible protocol. However, it can also be misconfigured So a bad configuration on one server might affect some communication in other parts of the data center. Fortunately, it is quite easy to protect against such problems by using BGP policies configured on the leaf switches. For this presentation, for the lab environment we have used quite a fresh Linux kernel, so we used version 5.3 and it would probably also be possible to run this on older versions. However, we do recommend this newer kernel for compatibility.

Also, keep in mind that this is quite a new approach. So, most of the things that we've shown here are based on standards, and are supported by multiple vendors. In case of this BGP autodiscovery, we actually had to write a script in one of the switches in order to achieve this functionality.

And also we did encounter some bugs and we have reported them to proper maintainers. And we'd also like to say a little bit more about that, because some of the bugs were quite interesting.


As I mentioned before, some issues arrived. There were a few of them. At some point, we have to use the eBPF to fix incoming traffic in one scenario. We will cover that on the next slide. The second issue that we saw is that as the data center will grow, the bare metal route table “pollution" might become troublesome. Scrolling down through 10,000 prefixes is no fun. Don't worry! Linux kernel can handle way more without performance drops. But this issue will still be troublesome.

This issue can be easily neglected using BGP policies that will advertise only default route toward the servers. The third issue that we encountered: there isn't yet native support for the BGP discovery on the Juniper equipment. So at this point, we have to be creative. Luckily, the Juniper devices itself are so flexible that it was possible to adapt them to this environment just with a few lines of the Python code and Junos library. We posted the source code on the GitHub and the link will be provided at the end of this presentation. Last, there was some BGP incompatibility between the FRR and one of the switches.

But as we mentioned, a patch from the FRR was clearly proposed to us and we can confirm it is working. As I have mentioned, there was one scenario where incompatibility was so big, it was not possible to work around it with proper configuration or scripting. So whenever the IPv4 traffic was exchanged between bare metal server and Juniper control plane, the return IPv4 packet from QFX/MX was packed incorrectly into IPv6 Ethertype. That packet was dropped by the kernel stack resulting in no communication between Leaf and BMS. This issue was affecting only control traffic such as SSH/SNMP from the bare metal server to the QFX itself or to the MX itself. Transit traffic and the IPv6 traffic were unaffected. So it wasn't a critical issue.

Still, it was a bit annoying. We should be able to work around this issue with this simple tc utility along with the packet edit action. But tc lacks documentation and the existing documentation has errors. Then tc actions also are executed pretty late in line of kernel hooks. So applications operating on raw data such as tcpdump would still see bad traffic. At the end, we applied eBPF code to handle this issue. We think that this approach is simpler. The eBPF code is small, below 40 lines, extremely plain. In the code, there is a simple IF and one action that replaces 2 bytes in the incoming packet. The issue was solved fast. The eBPF itself is blazingly fast; it can be even offloaded to some NIC hardware for greater performance. Last, this solution is transparent to the rest of the Linux kernel. After applying the eBPF code, all applications such as tcpdmp/tc were seeing corrected packets.


And this is all the material that we managed to fit in this webinar. Because we didn't want to let you go with too much information, so let's quickly summarize what we have talked about here today. So what we've seen is definitely not a revolution. However, it is a good step in the direction of a scalable and automated data center where we use these openly available tools that are interoperable between vendors.

So when we use this approach, we get to have less configuration on the networking devices because, for example, we do not need to stretch VLANs throughout the data center. We do not have to manually configure BGP neighbors. No need to manually manage IPv4 subnets between the devices. This is all happening automatically. There is faster change management for sure when it comes to adding new servers to our data center. It is also possible to run the services not only on IPv4, but also on IPv6 and also both of these addressing schemes simultaneously.

And last but not least, some of the things that we've talked about is the fact that if we are using BGP protocol in our network as well as on the servers, we get to add these extra features to our data center, which will be mentioned in the webinars to come. And if anyone is already interested in this solution after this first webinar, just keep in mind that we're open to running PoC or, you know, providing an FUT for you.

OK, again, just a short teaser. The next webinar we're thinking of doing in one month's time and we will keep you posted.


OK, so we arrive Q&A. At the bottom of the slides, you can see the link to the GitHub. Feel free to download and share, post issues if you find any. And you have some questions. Please feel free to post them in the YouTube chat, so we can answer them here right now or later on.


So right now, we'll wait for a few minutes to see whether or not somebody got the question during the presentation. And if not, like already mentioned, you can reach us, through our emails or GitHub and we'll be more than happy to answer the questions that you might have.


OK, so there is a question: since BGP neighbors are dynamically created, how do you ensure compromise servers won't announce not their IP address space or establish multiple BGP sessions using multiple IPv6 addresses?

So when it comes to this automatically created neighbors and established sessions. This is true. This is true of them being done automatically. However, we get to control what sessions are being established on the switches if we use appropriate configuration. And we also can ensure that no IP address is advertised, that we do not want to because we've got this capability of using routing policies where we can filter the information that is being advertised between networking devices.

And this is actually one of the reasons that we use BGP protocol here. It is very flexible and it does provide us with the capability of making sure that only the information that we want to advertise is forwarded throughout the data center. We can totally control that.


Also, the FRR, as far as we remember, supports the RPKI infrastructure so we can enforce that it will accept only prefixes signed cryptographically by the private key, by the FRR itself. So the third party won't be able to access our network because they want to be in possession of a signing certificate.


Yes, so we hope that this answers this question. Maybe I will expand a little bit on the filtering while we wait for any other questions that might come. The filtering of routing routes is also important when we want to scale the data center to a really large number of servers because we have the ability to use the routing policies to aggregate information about the available addresses. And thanks to that, even if we have hundreds of thousands of servers, we are able to only advertise the routing prefixes that need to reach specific devices and also make sure that we do not advertise every single IP address and that we aggregate them together in, for example, /24 or /16 sub networks.


We have another question. In case there is no IGP involved like OSPF and this is entirely BGP everywhere. Yes, this is a pure BGP solution, there is no OSPF or IS-IS running here, only BGP.


And again, OSPF and IS-IS could potentially be used, however, they are less scalable and they also are less flexible when it comes to this route filtering.

All right. And another question, what is this solution's scalability? So it will be in thousands of servers, again, to the capabilities of BGP protocol. Now, BGP protocol is also the protocol used in the public Internet. And right now it is handling 700,000 routes or even more. I haven't checked recently. However, the BGP protocol is able to have this many routes that are being advertised throughout the network.


And when it comes to the server itself Linux kernel tests have been made and Linux kernel can handle at least 500,000 prefixes without any performance drops. And whenever someone wants a better performance or better throughput, it can use a DPDK data plane. It's called VPP that can increase and can enforce line rate even for the smallest packets.


Mm hmm. And we also have got an answer to this question of scalability. And indeed, it is limited by switches TCAMs and forwarding information base. So we need to take care to not overflow the forwarding information tables on the networking devices. And again, a solution to that is route aggregation, not to advertise every single IP address to every device in the data center, but to aggregate the prefixes.

Yes, the configuration of VMX is available on GitHub. We've put the configurations that include the most important parts of what we've talked about today. These are not full configs. However, you should be able to, you know, take this configuration, implement in your lab environment and see how it works for yourselves as well.


If you have any issues after applying that configuration, feel free to reach us via GitHub issues—we will be watching them closely.


All right, IPv6 lookup scalability is lower than IPv4. Is that not an issue? Well, as you might have seen, well, maybe you went by this really fast. However, at least in this demo, we are using IPv4 addressing for applications and servers. And these are the addresses that are being advertised. IPv6 is used only for the next hops that are advertised using BGP protocol, and the actual packets are using the MAC addresses. So we are not forwarding IPv6 addresses. We are forwarding IPv4 packets. And the IPv6 lookup probably is a little bit slower. However, again, it is quite performant in kernel Linux, so it shouldn't be a problem either. And also the networking devices, at least the switches do perform this IPv6 packet forwarding in hardware as well. So this is the same performance. OK.

Any other questions from anyone?


OK, thank you for your time. Thank you for your attention. You can still reach us via GitHub or the email or YouTube comments. And thank you.


We hope it was interesting for you. And we hope that you will join us on the next two webinars where we continue to develop the solution in the data center.


Adam Kułagowski

Principal Network Engineer

Jerzy Kaczmarski

Senior Network Engineer