We are almost always online in our hyper-connected world. In just 10 years, the number of active device connections rose from 8.8 billion in 2010 to 21.6 billion in 2020, and it is expected to further increase to 41.2 billion by 2025 (according to Statista data). This rapid growth raises new technical issues regarding network traffic control and processing. Especially when we combine them with the development of 5G, a new standard offering peak data rates of up to 20 Gbps and supposedly supporting a 100-fold increase in traffic capacity and network efficiency. So how to prepare network infrastructure for what’s coming?
With the development of 5G, there are some significant market expectations related to the capabilities of the new technology. Some of them are:
- Many more devices connected to the internet (e.g. Internet of Things devices or any other "smart” devices).
- Increased availability of high-quality services (e.g. 4K video streaming) requiring higher bandwidth.
- Decreased latency (a 1 ms end-to-end round trip delay at most).
However, to meet these expectations it will be necessary to prepare an appropriate infrastructure that can handle the many devices generating such traffic. And It turns out that 5G not only creates challenges at the wireless level, but also at the wired level. How come?
Network traffic is initially received wirelessly by 5G antennas, and is then transmitted over the wired network. Because of this wired network devices will have to support the transmission of much greater traffic compared to pre-5G times (read more about this here and here). For this reason, network traffic processing improvements have begun to be incorporated into such devices.
In this article, we will focus on hardware improvements, and software improvements available on Linux PC machines.
>> Interested in networks topic? Check out our article Is Network Service Mesh a service mesh?
In a typical kernel network stack, packet arrival is indicated by an interrupt that is intercepted by the kernel (one interrupt may indicate arrival of multiple packets in high traffic). The received packets then pass through the kernel's network stack, where they are modified, dropped, forwarded, etc. Then, via sockets, they go to the user space, where the user can read and process them.
What are the properties of this approach?
- Constant communication with the kernel space (for configuration and packet data reading) resulting in many context switches.
- Thanks to the interrupts, we don't lose CPU power for receiving incoming packets in the case of low traffic – when a packet arrives, we are notified about it. However, in the case of high traffic we lose a significant amount of time handling the interrupt system. It would be much more efficient to expect the packets to come continuously and receive them constantly in a loop.
- It is impossible to directly modify the packet processing algorithms without modifying the kernel.
- Sometimes packet traversal through the kernel network stack may incur a performance cost that we would like to avoid with heavy network traffic.
So, how can we approach this issue and improve network traffic processing? In this article, we will briefly present two software and one hardware-related solutions:
One software improvement of this process is the XDP (eXpress Data Path) support introduced in Linux 4.8 (2016). XDP allows the programmer to write code that will be executed in the kernel at a very early stage of receiving network packets. This code can decide the fate of the packet without sending it through the kernel network stack. Underneath, the XDP support is based on eBPF capabilities.
XDP solves several of the problems mentioned – programmability is much better, no time is wasted going through the stack, and communication with the kernel is limited (eBPF programs are executed in kernel space, so they do not require switching between kernel and user space). You can read more about XDP here.
Another software improvement, recently growing in popularity, is the use of DPDK user space poll mode drivers. DPDK allows you to configure NICs and handle all network traffic from the user space. This allows you to bypass the entire kernel stack, providing the programmer with an easy way to modify the behavior of network applications without having to write any code in kernel space. Additionally, almost all communication with the kernel is eliminated so no time is wasted on context switching.
Using DPDK does have some consequences though – interrupt handling is a kernel domain, so from the user space level we have to use the polling method to receive packets. However, for very heavy traffic where interrupts are often just a waste of time, it is actually an advantage.
A hardware improvement is the use of SmartNIC – a programmable network card. The SmartNIC is an enhancement to regular network cards that allows extensive configuration of the card's behavior for specific types of traffic. For example, you can program the SmartNIC from the driver level to perform a specific action for a certain type of packet without requiring the CPU to read the packet. This allows the packet to be processed entirely by the network card, which of course greatly reduces latency and CPU load.
Unfortunately, SmartNICs are often relatively expensive solutions and difficult to implement – configuring a SmartNIC often requires writing a much more complicated driver than for a regular network card (this is slowly changing with the increasing popularity of the P4 language that allows configuration of network devices using common code – you can read more about P4 here and here).
>> Read case study of using ONOS SDN controller with P4-programmable smartNICs to offload VNFs
The approaches to network traffic processing improvement described above are some of the most popular to implement. However, there is another, less popular and still tested hardware-based solution with interesting potential.
Some time ago, graphics cards began to be used for more general problems than just rendering images (so-called General-Purpose GPU). Currently the GPU is used for many other parallel problems that require substantial computing power, like machine learning, weather forecasting systems or physics simulations.
Graphics cards provide the programmer with a lot of multithreading at the expense of operating in the SIMD (Single Instruction Multiple Data) model, i.e. all threads execute the same instructions but on different data. This model generates some limitations that need to be taken into account when creating algorithms for the GPU. For problems requiring a lot of synchronization between different threads, these limitations are hard to get around. However, tasks that require a lot of similar calculations, without the need for frequent communication, may be much better under such a model.
Recently, research has focused on testing the use of GPUs for network traffic processing. Intuitively, this seems quite sensible – GPUs provide massive computing power which can be used to manage millions of packets per second. Also, the SIMD model doesn’t seem to be very restrictive in this field – network processing algorithms often perform the same procedure for each incoming packet, which should be easy to parallelize on the GPU.
Another feature in favor of using GPUs in traffic control is their popularity, which results in a lower price compared to, for example, a SmartNIC with comparable performance. Also, programmability is a lot easier compared to SmartNICs thanks to the CUDA and OpenCL frameworks that allow hardware-independent code implementation.
However, using GPUs in computer networks is still a very niche approach. This begs the question, isn't it just a technological novelty that sounds cool conceptually but turns out to be unrealistic in practice?
Before considering the use of GPUs in networks as something viable, the following areas should be examined:
- Network traffic processing algorithms:
Algorithms running on GPUs are often significantly different from algorithms solving the same problem on CPUs. Sometimes using the GPU to solve a problem proves to be inefficient because, for example, the problem is difficult to distribute across thousands of GPU threads.
- Technological limitations:
We can create efficient network packet processing algorithms that can conceptually run up to 100 times faster on the GPU than on the CPU, but we will achieve nothing if, for example, the time required to copy network packet data on the GPU takes significantly longer than solving the entire problem on the CPU. In such a situation, in addition to properly thought-out algorithms, some hardware support is also required.
- Software support:
Sometimes the hardware supports the technology, but there are software issues, such as no support for it in the device drivers or no examples showing how to even use it. All of this makes it difficult to use this technology, making it much less practical.
- Comparing the whole solution on GPU and CPU:
Even if all the problems are solved, it may turn out that preparing an environment with a GPU may cost more than creating an environment with a CPU alone achieving the same results.
The idea of using the GPU in networking, although quite attractive, still seems to be a big unknown and requires further studies, which are currently being conducted. Hopefully, we will gain more valuable data on the matter soon, especially given that there is huge potential in using the GPU for network processing. With the growth of 5G and extreme connectivity we need to search for cost and time-optimal solutions.