The modern, interoperable DC - Part 2: EVPN as a universal solution for VM, container and BMS networking (video)

Adam

Adam Kułagowski

Jerzy

Jerzy Kaczmarski

Reading time: 79 minutes

This video is a part of our series "The modern, interoperable DC", which walks you through connectivity between different types of resources.

In Part 1 we guided you through a solution for DC connectivity based on a combination of FRR, Unnumbered BGP (IPv4 with IPv6 link-local NH) and eBPF.

Now it’s time for Part 2!

  • In this video we will continue to enhance our DC with additional features made possible by open standards.
  • We will show how to leverage a BGP router running on servers to provide layer 2 connectivity between heterogeneous resources, such as virtual machines, containers and bare metal servers (both legacy and FRR-based).
  • You will also learn about CNI (Container Network Interface) and how it can be integrated with FRR in order to automatically advertise information about IPs and MACs of newly created containers.
  • The speakers will also explain:

    • What VXLAN tunnels are and how they carry layer 2 traffic
    • What an Ethernet VPN is and how it can be useful in DC
    • How to provide multi-tenancy through the use of VRFs and tunnelling protocols
    • How to interconnect VMs, containers, BMS and other resources through an IP fabric
  • We will present a demo showing how the solution works in practice.

The source code, topology and configurations used during this presentation are available at Github repo.

Transcript

Hello and welcome to today's webinar for those who have missed the previous part. My name is Adam and my name is Jerzy and we both work at CodiLime. Today, we'll introduce you to the so-called Ethernet VPN and how it can be used to interconnect different resources at the data center. But first, a few words about CodiLime  CodiLime was founded in 2011 and now we have more than 200 people on board. We are no longer a startup, that's obvious, but we try to keep its spirit, a culture of agility, innovation and adaptability. And most of our team is located in Warsaw and Gdansk. However, we work with clients in six different time zones, so part of our team is also located there. With all that said, we often work with data centers that are deployed in spine and leaf fashion. That experience got us thinking is that architecture final? Can it be improved? What can be done to speed up deployments, speed up change and speed up things overall? The previous, and today's presentation is all about that. A lot of technology will be used during this presentation and demo. However, do not be afraid. They are not new. They are open source, they are well known. We are just connecting them all together to create something better, something newer. Right. So let's start by revising the set of mechanisms that we introduced during the first webinar as this is something that we will build upon and extend today. So what we did is we used the so-called IPv6 link local addresses, allowing us to automatically assign IP addresses for the interfaces, interconnecting the networking devices and also for the interfaces between the leaf switches and the servers. And this is especially important in case of large data centers with thousands of connections, thousands of servers, because this IP addressing does not have to be done manually. Now, even though we are using IP version six addressing, we can still achieve native IP before connectivity between the servers. So we've got the whole stack. We can use IPv6 and we can use IPv4.  Thanks to using IPv6. We also get the possibility to use Neighbor Discovery protocol, which is also part of this standard. And thanks to it, we can automatically establish BGP sessions between, for example, our servers and the networking devices. So, we do not need to manually specify the IP addresses of peers in the BGP configuration. Now, thanks to using a dynamic routing protocol, we can achieve load balancing when forwarding traffic through our data center. So, we've got multiple paths to the destination IP. These paths can be used together simultaneously to achieve better throughput. And if one of the interfaces or if one of the devices goes down, the fadeover will be really fast because we've introduced bi-directional forwarding detection into this topology, which allows for sub-second fadeovers. And the choice of BGP is very intentional here because it is a very scalable protocol. It is very flexible. It allows for root filtering for root  aggregation, which further increases scalability, especially when we have thousands of IP addresses and thousands of servers. But it is also a prerequisite to some additional features such as Ethernet VPN. And this is something that we are going to talk about in today's webinar. So you might know that in case of typical layer two, communication in a data center, traditionally VLANs were used, so we needed to stretch VLANs throughout the DC. However, here we are using IP fabric. In order to provide layer two connectivity through an IP fabric, we are going to use so-called VXLAN tunneling where Ethernet VPN will  allow us easier configuration and some additional benefits which we will talk about in a second. Now, this layer two connectivity can be used to provide connections between some IP addresses configured on physical interfaces. It might be addressed as configured on some logical interfaces in the Linux operating system. It might be some Mac addresses configured on containers. On virtual machines. So basically we can interconnect heterogeneous resources and provide this layer two connectivity between them and at the same time we also can provide multitenancy. So multitenancy, shortly speaking, is the ability to separate services of different customers, even if they are running on the same server, on the same operating system, and even if the services are communicating over, the same physical network. We'll show a working demo solution of what we are going to talk about today. And at the end of this presentation, we'll have a short Q&A session. So if you have any questions during the presentation, please use the YouTube chat and we'll try to answer these questions at the end of this webinar. So, this is the agenda. And we hope that you will enjoy this presentation and that you will find it interesting. Let's start by introducing two main actors that will play a major role in this demo and this presentation. The VXLAN and the EVPN. The VXLAN stands for Virtual Extensible LAN. It's a network virtualization technology that attempts to address the scalability problems that are associated with large deployments. It uses, a VLAN encapsulation technique It encapsulates Layer 2 frame inside the UDP datagrams VLAN endpoints which terminate VXLAN tunnel may be at virtual or physical ports or virtual ports Linux known as VXLAN interfaces. There is one implementation of the VXLAN with the switch on the operating system. It's an Open vSwitch. It's a fine example, but we'll concentrate on using plain bridges in the Linux operating system along with the FRR and along with the VXLAN interfaces. In Linux. the VXLAN can be configured in three ways as a point-to-point interface with local and remote IP address to connect both sides. It can be configured as a point to multipoint interface, with local address. And remote addresses are discovered using multicast. Also, it can be used as a point to multi-point interface where all other endpoints are discovered and advertise BGP protocol or via some SDN controller. Since the solution where the EVPN and BGP protocol is the most flexible one and scale's find 2000 nodes we'll use EVPN inside the FRR software in our demo and in our presentation. Now let's talk about the Ethernet VPN. The Ethernet VPN allows you to connect group VXLAN and extend layer to resources, layer two bridges over layer three network The EVPN is used for signaling and for transportation. You can use VXLAN or MPLS, however, to avoid building MPLS data plane, and MPLS label exchange information we'll use VXLAN. That requires only a working IP fabric below the EVPN allows us to stretch, as I said before, the layer two connectivity and provide segmentation and isolation just like VLAN, but without any limitation of traditional networks such as Spanning tree. We no longer have to use it.  We want to counter layer two loops. There won't be Active/Passive links and the BUM traffic will be limited to as little as possible. Also, the EVPN has some additional features, such as Mac IP, advertisement or Mac mobility. So as soon as the resource appears on the network, the rest of the node, the rest of the termination endpoint will be informed immediately that the BGP protocol. So  there won't be the information of forwarding an unknown unicast. The end point address will be well known. So also, as soon as the endpoint will move, the Mac address will appear on a different machine. The dataplane will be updated using a BGP protocol. There is also a Layer 3 advertisement just like in the layer three VPN. But we are still using EVPN for layer two and layer three, using different type prefixes. For the layer three information, exchange for the layer three routing. We use type five prefixes and we can advertise slash 24 prefix, for example, into a different segment, different VRFs to have routing. Last but not least, the BGP EVPN is an open standard, is not limited to the proprietary vendors or appropriate equipment it can be run on Linux. And one of the examples of the demand that are supporting BGP EVPN is the FRR, the free range routing demon. And that demon can be placed on the machines that are hosting virtual machines containers Kubernetes  being a bare metal itself. This all will be shown during the demo. OK, so let's take a look at how EVPN can work in practice here on this example technology. We've got three different servers and a firewall and we want these resources to be able to be able to communicate at layer two. Now, the devices, the networking devices are configured using IP fabric, so we've got only layer three connectivity possible between them. So, in order to provide these layer two connections for our servers and the firewall, we will use the VXLAN tunneling, which will be configured on the leaf one and leaf two switch. What we'll also need is a BGP protocol with EVPN enabled. Thanks to it, it will automatically advertise information about the known Mac addresses that have been learned by each of the switches and will also advertise information on how to reach these Mac addresses meaning which VXLAN tunnel to use in order to do that. What we need to do in the configuration of leaf devices is to associate these interfaces with specific VXLANs, with specific virtual network identifiers. So, port one is assigned to VXLAN one and port two assigned to VXLAN two. And the same on the other device. When we have this configuration, we can see how this Mac address learning and connectivity works. So, for the Mac address learning, we've got server three and let's assume that it needs to send traffic to the firewall on the interface with Mac E so it will create a new Ethernet frame with source Mac address C and a destination Mac address E.  This frame will be forwarded to leaf two interface and it will arrive on Port one and leaf two will perform standard Ethernet switch learning. So, it will take a look at the incoming frame and it will write it down in the switching table. It will write down this source Mac address and it will write down the interface through which this Mac address is reachable. Now, before forwarding this frame to the destination, EVPN will also take a look at the switching table and it will notice. All right, I've got a new Mac address, Mac address C. I need to advertise it to all other BGP peers that are also running EVPN. So, it will take this Mac address and in our case, advertise it to leaf one. In the case of a large data center, this Mac address will be advertised to hundreds, sometimes thousands of devices that are also running EVPN. OK and leaf one upon receiving this advertisement will take a look at it and it will see. I've got a new Mac address, Mac address C, which will be put into the switching table and the next copy interface in the direction of the destination will be a logical, VXLAN one tunnel interface that leads to leaf two. OK, so this is the way that the new Mac addresses are advertised throughout the data center. Now, for the forwarding of traffic through the VXLAN tunnels. Let's say server one needs to communicate with server three. So again, a new Ethernet frame is created, source Mac Address A and destination Mac Address C. And this frame is forwarded to leaf one switch. It will take a look at this at the switching table and it will see I need to forward packets to Mac address C, so it needs to be encapsulated within VXLAN one tunnel and forwarded to leaf two so as an IP packet with UDP and VXLAN headers, it is forwarded to leaf two switch where it is de-encapsulated. So the original frame is taken out of the VXLAN header and leaf two switch will check the destination mark of this Ethernet frame again using a switching table. So, we've got Mac address C reachable through Port one, so this is exactly through which  interface the original frame will be forwarded to the server three. So, this concept is actually quite simple when it comes to what we've seen on the slide. However, the capabilities that it gives us are actually quite far reaching. And this is what we want to show you in the following examples. All right, so before, we mentioned that the VXLAN tunneling  and the EVPN can also be run in software, so not only on the networking devices, but also, for example, on the Linux operating system. In our case, in order to do that, we are using FRR because it supports data plane integration, meaning that it can manipulate Linux's bridge tables and routing tables. Now, let's take a look for a moment, at  server three. And we can see here that server three has one physical interface over here connecting it to the networking switch. But it also has some logical interfaces, bridge one and bridge two interfaces. And in Linux, there are several types of logical interfaces that can be configured in case of bridge interfaces. This is like putting a network segment inside of a single server, inside of a single operating system where layer two forwarding can be done. Now, it is also possible to assign an IP address to a logical interface, including bridge interfaces. So, here we've got IP address B and IP address H for the second interface. Now these IP addresses can communicate with each other, at least by default. So we ought to be able to ping one IP address from the other. However, in multitenancy scenarios we would actually require these bridges and for these IP addresses to be separated, because, for example, one of the IP addresses might be used to run a service from one customer and the second IP address might be running service from another customer, and they should not be able to communicate with each other. In the case of Linux there are several ways to achieve this functionality. And what we did here is we used VRFs so-called VRFs, which stands for virtual routing and forwarding and basically creating a VRF is creating additional routing and switching tables. So, here we've got Bridge one, which is assigned to VRF one, and that means that now Bridge one can only communicate with other interfaces, be it logical or physical, that are also connected to the same VRF. It is no longer able to communicate with interfaces located in global routing table or in other VRFs. Now, the same functionality is also available in most of modern data center switches where we can assign, for example, a physical interface port one to VRF one and interface port two to VRF two. And thanks to that, server one and server two would also not be able to communicate with each other. So this multitenancy can also be provided easily by the networking devices. It is also supported on the VXLAN tunnels. If we use different virtual network identifiers, then the network segments in the overlay topology that we create are also separated from each other, at least by default. So, this is OK for us. Now the only thing that is left is to also have EVPN sustain this separation, sustain this multitenancy. And it does support it. And in order to do that, it uses so-called route targets. This is a feature of BGP protocol and basically it is associating, it is adding some extra values to the Mac and IP information that is being advertised by the EVPN protocol. So, for example, here we've got again server three with bridge one and this bridge one interface has Mac address B associated with it. This Mac address will be put by the operating system in the switching table for VRF one and EVPN will notice, all right, I've got a new Mac address that I need to advertise. However, it will be advertised with the route target value that is associated with this VRF one routing table and switching table. So, this information will be sent to leaf one switch and leaf one will take a look at the advertisement and also make sure to note the route target, because this will determine which switching table it will put the information into. Here we see that Mac address B is being entered into the switching table for VRF one and the destination is VXLAN interface locator leading to server three. All right, and thanks to this configuration we achieve a topology, we achieve an environment  where we have server one with IP address configured on physical interface, which can communicate at layer two and layer three with an IP address that is set up, that is configured on some logical interface on server three. Another server, server two, being able to communicate with an IP address configured on the second interface on server three. However, these interfaces are separated from each other using VRFs. So, we've got this multitenancy sustained here as well. One last thing to notice on this slide is that leaf two does not take part in VXLAN tunneling and doesn't need to run EVPN. And in truth, if most of our servers would run FRR, then we wouldn't need many or we wouldn't need networking devices that support EVPN or VXLANs, depending on the scenario. So, it  can be something that can decrease the costs of the solution if we are running EVPN and VXLANs in software. OK, so you might be thinking that this EVPN must be very complicated, that creating these VRFs on Linux is also probably hard work. But in truth, when you get the hang of it, it is not that hard and there are not many commands in order to configure it. So for example, here we've got a server where we want to create a new bridge interface with some IP address assigned to it. In order to do that, we issue two commands in Linux. We create a new interface, with the name Bridge one and the type of network bridge. And we assign an IP address to it. Now we want to separate this interface from all other IP addresses and interfaces that might be configured on the same server. So we are creating a VRF one and putting the bridge interface into it. So these are the three commands: create new VRF, enable it and associate bridge one interface with VRF one. As the next step, we want to be able to communicate, we want to be able to allow for communication with whatever is connected to bridge one through the VXLAN tunnel, to all the other Mac and IP addresses that are also associated, connected to VNA 1 and in the opposite direction. Also, everything that is coming from the VXLAN one tunnel to the server will be sent to the bridge one interface. So, in order to do that, we create the VXLAN interface with virtual network identifier one, standard UDP destination port for VXLANs. And we also make sure to specify here that the local endpoint will be at IP address of 10 01 11. So the end point of the VXLAN is at the loopback interface over here. We enable this new interface and we attach it to the bridge one logical interface. So, this is it when it comes to the configuration of interfaces in Linux. The last thing that we need to do is to enable advertisements using EVPN. And here we assume that we are starting from the place where we left off at the last webinar. So, the FRR demon is up and running. The BGP protocol is enabled. And what we do is we just enable EVPN address family, create configuration for VNI one where we specify appropriate root targets for our VRF and then we enable the EVPN advertisements. So, in truth, this is quite a few commands, but not a lot. And this kind of configuration, keep in mind, that can be simplified and automated using scripting. So it is not that hard to do at all. OK, so we have Linux and bridges connected and assigned to the VRFs. So now let's move to virtual machines.  We can connect them, there are two ways to do so. Each virtual machine in Linux, most frequently, is associated with the Tap interface, the layer two interface, and we can create a virtual machine, we can assign the Tap interfaces using the IP link set master command , and assign that Tap interface to the bridge. The bridge must already be assigned to the VRF. So, as soon as the packet leaves the virtual machines  it appears on the Tap interface and then goes to the bridge and then is handled by Linux and routed according to the routing tables inside the VRF. Or we can create bridges using Libvirt, create networks in Libvirt and then after Libvirt starts. The net will assign those bridges to the virtual VRFs and then create virtual machines in Libvirt using Libvirt tools and assign them to the Libvirt network. And those machines will appear in the VRF automatically. So we can even have the DHCP running in Libvirt and the machines in the network thread will have IP assigned. So here we can see that we have Virsh and we are listing the interfaces of the Alpine blue and we can see it is  assigned to the blue network and we have configured that the blue bridge is assigned to the separate VRF. So it's separated. So, we can have two bridges, two networks in the Libvirt and the virtual machines will be totally isolated from each other as long as they belong to the different VRFs. So this is the virtual machines. Now let's move to the containers. And the thing with the docker or the containers or the LXC are a little bit more complicated. The container is using Hoth Kernel for the network part, so we have to provide a separation there. So we have to have a separate space for the interface, for the routing table, for the layer two tables we can use Linux namespaces for that. Generally a Linux namespace is a separation of all network things, all together with the multiple separation. And with the same operating system we have a global namespace. It's always existed, but we can create a new namespace. Having a name space is only half of the way. We have to put the network Interface, there. Putting a physical interface will give us nothing because we lose that interface from the physical global routing table and we lose connectivity. So we can use the VETH interface. And those are funny things, they are like virtual pipes. So what comes in at one point comes out at the second point. So we create a VETH pair. We move one end point to the namespace that is dedicated for that container. The other part at the other end point that is left in the global routing table. We assign that to the bridge that is already in the VRF. So as soon as the container in the namespace creates a packet, the packet goes to the VETH, it appears on the other end of the virtual tube, and that other element is assigned to the bridge. So, that packet assigned, appears on the bridge and is routed accordingly to the routing tables of the VRF. Now let's move to Kubernetes, Kubernetes was the hardest part to adapt, although the concept of Kubernetes is pretty simple, there are containers connected to the bridges, just like in the previous slide. Soon it became obvious that the devil lies in the details. So, all the ports share the same bridge, the same flat network. None of the existing CNIs supports BGP plus a VPN, and VXLAN. Some of them support VXLAN, some of them support BGP. We haven't found any that supports both of those things. The Kubelet must communicate with the pods on each node. So, just for example, for the health check. So, if the pod is in the dedicated VRF that Kubelet is separated from in the global routing table, there is no communication. The pod has failed. Also the subnet for the services 10.96.00/20 is flat. It cannot be divided. All those things could be solved with writing a new CNI, but that would be a huge amount of work and it would kind of feel like reinventing the wheel. So, let's deal with our problems one by one. For the one flat network we can use the CNI called Kube namespace CNI. It's a core OS project. It allows us to attach pods to bridges based on the Kubernetes namespace. This is different from the Linux namespace. Do not mix them up since the bridges are already assigned to the VRF. We have correlation between Kubernetes and the VRF. That part is done. As for the communication between the bridges, between the nodes that was taken care of by the FRR and the BGP and advertisement using EVPN family. So, that part is done as well. For the communication between Kubelet and pods, we have to use a thing called route leaking again we can use VETH interface. One interface is assigning the global routing table. The other is placed in the VRF. We do some static cross and then we can communicate from the pod with the pod between the global routing table and the pod. For the last part, for the service network, we can divide that network manually into slash 24 prefixes, assign each prefix into the dedicated VRFs or the names  from the Kubernetes point of view and use that part, that subnet for advertising services. So, with that all together we have working Kubernetes and we can move to the demo. The demo agenda is divided into nine parts. First we'll show and try to explain the topology and then we start creating resources, just like Alex said, Kubernetes pod, Kubernetes services and we'll show that we have connectivity inside each VRF while having no connectivity between VRF. The virtual  machines will be already created because they are taking a lot of time to be deployed and configured, so that part was skipped, but this can be done easily using virsh. OK, this picture presents the whole topology with all the details. I agree, we all agreed it's a little bit of clutter there. So, let's break the information into separate slides and deal with them one by one. First, each node was assigned a different role. The first node was designed for Libvirt. So, there will be virtual machines running there. The second compute node was designed for containers. The LXC containers.  The fourth node was designed for Kubernetes; the fifth was for the controller. The other two were for the post deployment. And the last node was a legacy bare metal server, which has no FRR running, and the connectivity will be provided by the leaf one. And leaf two switches with the proper configuration there. OK. We mentioned the BGP protocol, so let's see how things are connected at this level. Most of the compute nodes, all of them that are running FRR have two external BGP sessions to the leaf switches,  those sessions are used purely for advertisement of the node and advertisement of the switches addresses and are being configured automatically, as explained in the first webinar. This is just a few lines of the configuration in the FRR. The rest happens automatically. Using IPv6 advertisement and it was covered in detail last time  The second type of neighborship is internal BGP. All of the poles that are hosting the FRR are connected to the spine switch, the spine switch is acting as a route reflector and the internal BGP protocol is used for the EVPN family to advertise resources that are placed on each compute node.  So, we have BGP, we have assignment. Let's move to the VRF. So, each compute node has deployed two VRFs, the VRF red and the VRF green. We'll go into the details later on. The last compute node is a legacy. There is no VFR, no VRF configuration there, so it is assigned only to a single VRF and that assignment is done on the switches. That compute node is connected via the bonding interface, and it thinks it's connected to the single switch. So we have redundancy. Both of those things are active. However, if one switch fails, the compute six will transparently move to the second switch while keeping all the connectivity, a little bit less performance, but still it will have communication. OK, so we have VRFs, so let's talk about IP address assignment. On the compute one and the compute two and also on the legacy server. We have stretched one network 10.1.0.0/24,  so all resources on the compute node one, VMs or container's placed in the VRF red will be visible on layer two and will give IP addresses from that address space. The same goes for the VRF green There is 10.2.0.0/24 so VMs placed in VRF green and container space there will be given IP addresses from that subnet on the Kubernetes we decided to go with the simpler configuration. Each node on Kubernetes has assigned a different slash 24 prefix for each node, also for each VRF. OK, so let's take a look at how it will work in practice. So, in the first demo, in the first scenario that we show here, we've got only virtual machines configured and they are running on Comp one server. We've got VM red here and the VM green in the bottom. There are no LXC containers running. There are no Kubernetes bots running at this time. So, let's see on the terminal how it will look. We've got here the terminal of comp one server where the virtual machines are running. So as the first step, let's see what networks are visible for Libvirt. So Libvirt C is the default network, the Green Network and the Red Network. So we've got the two networks that we are interested in here  during this presentation. Next thing that we want to check is the virtual machines that are running on the server. And we indeed see two VMs, alpine red and alpine green. Sorry. All right. The next step that we would like to check is where these virtual machines are connected to, to either the default network, the Green network or the red network. So, for the alpine red virtual machine, we've got a single interface assigned and it is connected to the red network. And this is the Mac address that is used by this VM. OK, and let's just check the second one, the second virtual machine connected to the Green Network, also with a single interface and a different Mac address. So, what we can check now is, we can log into these virtual machines. Let's start with Alpine red and let's see the IP addresses that are configured on these interfaces. So, it's got the single interface that we are interested in, Ethernet one with IP address, 10.1.0.60. So, this is from the range of this VRF red. Now there are no containers from either LXC or Docker running at this time. However, there is a default gateway configured in both VRF red and VRF green. So, let's see if we can ping this default gateway. All right. We've got communication here. This is good. Let's continue with the green virtual machine. OK, sorry, we're going to log into it right now, again, CD, IP addresses. This is a different subnetwork. What we are going to check, whether or not we've got connectivity to the default gateway in VRF Green. Yes, we do. However, we should not be able to ping the IP address of the emirate. So even though both of these VMS are running on the same servers, they are put in different VRFs. And indeed they cannot ping each other. Now, the information about this virtual machine is their Mac addresses and their IP addresses should be automatically advertised by EVPN. EVPN is already configured. It is up and running and the advertisements should reach all other BGP demands in our data center. So this obviously will include second server comp two. So let's take a look at the neighbors. So ARP entries in this scenario for VRF red on comp two. And we are interested in the IP address of VM red. OK, and we've got it.  We've got IP, we've got its Mac address. And we also have the information that it has been learned from an external source. In this case, EVPN. All right. This should also be true for the green VRF. But here we are looking for the IP address of VM green. And again, we've got it. It has been advertised by EVPN and all other EVPN devices in the data center do know about it. All right. And finally, the IP route of VRF red when we actually can see different IP addresses that are already preconfigured and located in this VRF not only individual IP addresses, but also sub networks, which are advertised by this type five EVPN route. OK. And now let's continue to the second step. Where we create LXC containers on the second server and we are hoping that we will be able to ping between LXC container red and VM red and that we are able to ping between VM Green and LXC Green. But they should not be able to ping between the VRFs, only within a single VRF this communication should be possible. So, let's jump back to the terminals. And at the beginning, let's make sure that at the start, no LXC containers are running. All right. Now we want to create an LXC container in VRF Green, the same configuration that already assigns these new containers to the appropriate network that is associated with VRF red and now VRF green for the second container.  Right. So they are created. Now we need to start them up OK. And they should be running by now. So LXC LS. We've got containers running, let's look into the LXC red and see the IP addresses configured. Sorry, one second. A slight technical trouble, OK So these are the IP addresses. One more second, please be patient. OK, I think we're a little bit.. All right. But we are back on track. All right. So, this is the LXC red container. It's got an IP address automatically assigned using the HCP. The HCP server is actually running on Comp one server, on Libvirt. And we want to check whether or not it is able to communicate with the virtual machine that is running also in VRF red. And this is the truth. It can ping it without a problem. Now, just to make sure, let's see if it can communicate with the green virtual machine. And what we see, it cannot do that, which is also something that we did expect. Now, one more thing that we can check, that we can show you is that Comp one has learned information about these containers. So, EVPN advertised information about these new Mac and IP addresses, which we can show in the VRF table in the routing table for VRFs red on comp one. And here we can see the IP address of the container red and we can see that it is reachable through address 10.00.2  which is the loopback interface on server comp two. This is the IP address to which  the traffic is being tunneled using VXLANs and having finished with the containers. Let's move to Kubernetes. And during this part, we'll first create  the deployment of the two NGINX. First, let's check if the resource is running. We can see that on the default pod, running on Kubernetes expression here, we have some namespace preconfigured already, the green and the red. Again, those name spaces are typical namespaces of Kubernetes.  Nothing new, nothing special here. And we can use those namespaces to do a deployment. So, first let's create a deployment in the namespace called red. OK, the deployment has been successful. Let's see if it's true. And we can see that the pods in the namespace red have been deployed, and have been assigned IP addresses from the network that are assigned to the compute four and the compute five. We are good.  Now let's move to the next part. Let's deal with the namespace green here. Again, you are using the same comment, the same deployment, but we're specifying a different  Kubernetes namespace, again, it was successful and you can see that the pods are running on the compute node four and five that are given different IP addresses dedicated for the green. So there is a different second octet. So it looks fine at this point. OK, so we have pods running. Now let's create services. Let's exposed the first set of pods in the namespace red. And let's assign static IP address 1096.10.100 . As we said before, we divided that subnet 10.96 into separate prefixes and the 10 on the third octet was given to the namespace. red. OK, the exposure was successful. Let's do the same with the namespace green. Done OK, now let's go to the container on the compute node two and see if the container can reach the container in green. Can reach the service that we exposed previously. OK, it took some time, but we can see that we have communication, we can even have communication  from the container directly to the pod placed in the namespace green. So it was successful, last, check if there was some load balancing inside the namespace green. And you can see that the requests were divided mostly equally between two pods that were deployed on the namespace green. And at last, we can see the direct communication from between the container and the pod itself. OK, so this concludes the Kubernetes part. Now let's move to the bare metal server. So the last thing that we want to show you during this demo is that we also can achieve this layer two communication from a device that doesn't have any support for VXLANs or EVPNs. So, in our server, in our case, a bare metal server from which we would like to be able to reach at layer two the LXC container and the virtual machine and also to be able to reach the cluster IP address in the red VRF, just like we did in the example before that, before this one. However, at the same time this bare metal server shouldn't be able to communicate with IP addresses in VRF Green, and this is because it is assigned to VRF red on the networking devices. So, here the separation and multitenancy  is provided by the Juniper and RSS switches, which appropriately assign traffic from the bare metal server to VRF red  and this is being done over bond zero interface configured on comp six server. So let's take a look at the consoles again. And here we've got the console of comp six. We can see that there is an IP address configured or actually received from the The HCP server in the VRF red 10.1.0.200. And what we would like to be able to do is to  be able to ping to communicate with the VM red located in the same VRF. So this works without a problem. What also shouldn't work without a problem is communication with Kubernetes pods. So bare metal server to Kubernetes and it also works as expected. Now, what shouldn't work if we've configured our switches correctly is trying to communicate with the cluster IP located in VRF Green. So, we'll try to ping it. And we can see that the message that we receive is that the destination net is unreachable. So, this subnetwork isn't even located anywhere in the VRF red. It is only reachable in VRF green in this topology. So, everything is working as it should. Now, let's also take a short look at the configuration of the networking devices that allow for this connectivity. So first off, the leaf one switch which is Juniper and some of the interesting commands that we might issue is, for example, the EVPN Mac IP table where we can see the mapping between IP addresses and the Mac addresses. And these are the addresses learned by using EVPN that are located on different servers in our data center. If someone would be very curious and would like to see all of the information advertised by BGP with EVPN, we can actually show a table that lists all of this information. It might seem messy, however, you can actually see some mappings between the Mac addresses and IP addresses that are being advertised over here. What you can also see are the type five routes where we advertise the whole prefixes. And also, we can advertise individual IP addresses of hosts, So, all of this information is here for you if you need it for some extra configuration, for some other advanced scenarios. And just a short, look at the configuration of VRF on the networking device over here on the Juniper. So we've got a routing instance, which is at VRF. We've got some layer three, the default gateway attached to this VRF. And we also have this magic value of route target which allows us to map the advertised information with specific VRF. OK, so this is on Juniper. Now let's quickly see the similar information on Arista. So, here we can also check the ARP table. What we can also check is the routing table as well for specific VRF. This is the VRF red where we can see the individual host addresses as well as whole prefixes. Whole sub Networks. All right. And also let's take a quick look at the configuration of the interface that connects the switch to the server comp 6. So it is configured with some EVPN Ethernet segment identifier and it is transferring traffic on VLAN 20. Now, this VLAN 20 has a layer three interface, which is again a default gateway, and it is assigned to VRF, VNI 10. So this is the red VRF. And finally, let's take a look at the BGP configuration for this VRF where we will be able to see that it also has the same root target assigned that was also used on the Juniper switch. So, as long as these route targets are the same, we've got the separation correctly configured and we can achieve connectivity for VRF red and make sure that the host will not be able to communicate with the VRF green in this topology. OK. And this actually concludes our demo, so we hope that we have been able to show you how this can work in practice. Now, let's talk a little bit about the advantages and maybe some disadvantages of the things that we have shown you today. So, hopefully we are able to present that we do get this capability of transmitting layer two traffic over layer three only connections such as IP fabric. Now, we've shown this within a single data center. However, this concept can also be used between data centers, our data center and interconnect links. It can also be used to stretch layer two connectivity to resources located in public clouds and maybe also to some servers, edge servers that are located somewhere on the Internet in general. So this is something that we'll cover on the next webinar. Now, another thing is the interoperability between hardware and software routing demons. So, we've run EVPN on Juniper and Arista and they were able to correctly communicate, exchange information with each other. And for example, FRR can also be run on different Linux distributions, such as Redhat, such as Ubuntu and others as well. So we do not depend on a single supplier from either software or hardware. And it is perfectly OK to have multiple vendors, devices and solutions from multiple vendors in our datacenter. What we also showed in the presentation, on the demo is that we can interconnect the different kinds of resources. So BMSs, network appliances such as firewalls, logical interfaces on Linux, virtual machines, console containers, all can communicate at layer two. And this can be quite useful when it comes to scenarios when we need to migrate some applications from one platform to another. So, for example, from bare metal servers to virtual machines, especially if it is required for this layer two connectivity to be available there during the migration. Also, it actually enables us to have a heterogeneous data center. So, we might have a data center where we are using VMs for one type of applications. We are using bare metal servers for other services and we can actually have applications that some part of the cluster might be running on a VM and another part of the cluster might be running on containers, which might lead to higher uptime of such an application. EVPN also supports some extra building mechanisms and it is important to know that the EVPN standard is still being extended. So, there are new drafts being written. New functionalities might arrive in the future and this is quite a future proof solution because of that. The EVPN is not going away. It actually assumes that it will. It is here to stay with us for many years to come. All right. And the last point that we wanted to make, we've already mentioned that we've got an FRR running on most of our servers. Then we do not need many switches that also support EVPN or VXLANs. OK, and now for a few disadvantages. First off, this is a solution where advanced configuration might require good networking knowledge. Obviously, this configuration can be simplified for scripts. We could use some playbooks. We could use some external management solution that does the EVPN configuration for us. But the truth is that sometimes you do need some extra knowledge in order to configure it appropriately in the best practice manner. Now EVPN is relatively new, at least compared to protocols such as OSBF or ISAS. However, it is stable because it has a few years behind its back. You might expect some minor bugs on some implementation, but nothing major. All right. And if we are using VXLAN for tunneling traffic over layer three, then we should expect slightly lower throughput. This is because we've got some extra headers, extra UDP and the VXLAN headers in which we encapsulate layer two frames. So it means that the throughput will be a little bit lower. In case of large frames, it might be up to three percent lower, a little bit over three percent. However, if we use jumbo frames where the maximum transmission unit of a frame is nine thousand bytes, then in this case this effect is really minimized and it is also visible for large packets, in case of small ones, it isn't really a problem at all. Another thing for VXLAN tunnels is extra CPU utilization. The servers need to use CPU cycles  in order to encapsulate the layer two frames inside of VXLAN headers. Now, this can be visible in case of small packets where there is a large number of small packets. However, this effect can again be minimized if we are using network interface cards which support hardware, VXLAN offloading and in this case the CPU utilization will be negligible. OK, so we are slowly coming to the finish of the presentation. But, first let's talk about the problems that we have encountered. There were quite a few of them. On Juniper. We discovered that each EVPN note that is sending layer three traffic toward the Juniper switch, must also advertise at least one unique type five row itself. It doesn't have to be the row that is the same as the packet that is sending. However, it must be some kind of row. It's a bug, it's already known there is a PR for it as well. On the virtual Arista the VM2 has to be increased for the FWD interface. And the VX interface. Those interfaces are from the configuration. So we'll have to switch to the shell and manually do IP links at MTU command to increase them to allow jump of rank processing or even 1500 packets processing. To use that interface for the route leaking described in the Kubernetes scenario we have to change route table priorities and we have to move the local routing table that consists of the local interfaces just behind the routing tables of the VRF as shown on the CLI output showing defaults. So this route table has to be moved at least here to be able to do a route leaking. This configuration is default from Ubuntu, so it has to be changed. The script after this has been booted up. Another thing that we discovered is that due to ECMP load balancing between Kubernetes node and the fact that IP tables that are used by Kube proxy are stateful, the parameter masquerade all has to be enabled on Kube proxy to have services working properly. The other thing that is discovered is that the reverse packet filtering has to be disabled on compute node two or the SVN  IP address has to be enabled just  for the ACP to be working. When we come to the details this is understandable and it is easy to explain. However, we have to discover that there is some kind of asymmetrical thing that is causing packets to be dropped. A bridge hairpinning has to be enabled. Only three bridges on Kubernetes node. Again, this is required for cluster IP load balancing. without it the packet could be dropped. And last but not least, a few issues were discovered inside the FRR demon. However, the response from the FRR team was prompt using Slack channel and the fix came quite often, or most of the time within 48 hours. So they are doing a great job here. OK, let's summarize what we achieved during this presentation. We have shown that the Ethernet VPN is a great tool for layer two, also layer three connectivity. We can create a flexible deployment when everything can talk to each other. So, we can have Pods containers, virtual machine, bare metals, all connected via a single layer two or multiple layer three networks inside one VRF and isolation between VRFs. And this solution is priceless. We're doing immigration from bare metal to VMs while keeping them in the same network, and it gives us good automation possibilities so we can do ansible or terraform to create configuration and automate that deployment even further. We have a separation between services and this solution is very scalable. It is well known that our deployment of BGP EVPN reaches thousands of nodes and tens of thousands of entities inside the nodes such as VMs or pods. Okay, so we are at the end of this presentation. But before we go to the questions and answers, let's just take a look at what you might expect on the third webinar, the last one in this series. So what we will talk about is the VXLAN routing between two overlay networks, how we can achieve that. How we can communicate between underlay network and overlay networks of resources that are located behind those VXLAN tunnels. We'll also show how we can interconnect two or maybe even more data centers together. Again, using VXLANs and VPN, how we can extend layer two connectivity to public clouds such as Amazon and also to devices located somewhere on the Internet, such as edge computing servers or some Iot devices, also making sure that this communication is secured. That it is properly encrypted. And finally, something that we've called the branch office EVPN where we can extend EVPN and layer two connectivity to different offices spread out throughout  our company. OK, and this is it when it comes to the presentation itself. Now, let's go to the questions and answers session. If you have any questions. If something was interesting for you, if you would like to get a little bit more information about some specific part of the presentation, please ask the questions we'll be here for the next few minutes to answer them directly if it is possible. And if not, we can also answer them after the webinar has ended and put them in the comments as well, so they are kept there. So one thing worth mentioning here as well is that all the configuration that was used during this presentation or the script that were used for the deployment are already placed at the GitHub, on the link shown here, feel free to browse through them, use them or extend them as you wish. So, if you have any questions or we can explain something even further, feel free to use the YouTube channel to reach us. Also on this GitHub page, you will find configurations of networking devices as well. We've put the relevant configuration in regards to EVPN there. So, if you would like to configure a similar topology and test similar communication in your own lab, then you should be able to quickly adapt the configurations that we posted. And we do. We would like to invite you to do that because it is quite a fun experiment to do it. OK. All right, so we have a question: Do you have any methods to enable traffic mirroring for pods VMs in this environment? Well, since it's Kernel based, so at any point we can use pure DCP dump, we attach the DCP dump to the Veth interface or the bridge or the VRF itself. We can do some IP tables, rules that would provide us mirroring or we can do some kind of mirroring doing the S flow. So there are multiple options. And you can use any of them, so, yeah, it's possible and this is easily achievable, and if the traffic is going, for example, between a VM and a bare metal server through a networking switch,  the networking switches would also be able, at least many of them, to copy such traffic and do port mirroring, send it to some server that would be able to capture it, for example, again, using DCP dump or some other solution. OK, any other questions? OK, so thank you for watching. If you have any questions, then feel free to use the comments on that YouTube link and we'll check them frequently to provide answers. You can also reach us via Github.  by creating an issue. And the e-mail is shown on the first slide of this presentation. Feel free to use them as well. Thank you for watching. Thank you for being with us. And we invite you for the third part that will be shown in a few weeks.

About the authors

Adam

Adam Kułagowski

Principal Network Engineer
CodiLime’s Principal Network Engineer, Adam likes to push network packets faster and faster, or to drop them on purpose. Also a reader of SciFi and an escape room enthusiast.
Jerzy

Jerzy Kaczmarski

Senior Network Engineer
As a Senior Network Engineer at CodiLime, Jerzy focuses on advanced solutions for Data Center and ISP environments. He is a big enthusiast of making life easier through automation, including network configuration and management. In his free time, he enjoys mountain biking and board games.

Contact us

For more information see our Privacy policy