Service mesh solutions are increasingly becoming an important element of modern applications built on the basis of microservices architecture. But here and there, you can also hear about a project called Network Service Mesh. Is it another variation on the service mesh theme or something completely different?
Over the past few years, we have seen a shift away from approaches based on monolithic code when designing software applications. Instead, modern design is based on microservice architecture. At the end of the day, it is about delivering basically the same business logic, not in the form of a large monolith but as a collection of loosely coupled and independently deployable services.
Why do this? Well, it has its undoubted advantages. For example, microservices can have their own development cycle, can be developed using different programming languages and frameworks, and can be owned and managed by different teams. Such a decomposition into various functional modules allows you to optimize the delivery process of the entire application. Extending the application with new functionalities and updating or replacing individual functional pieces with new ones are much easier here. What is more, microservice architecture is by nature perfectly suited to containerization-based deployments (e.g. in such environments as Kubernetes, for example) which is certainly a very powerful feature, especially today.
However, it is worth knowing that this approach also brings up considerable challenges. If we want an application based on microservices to work reliably, efficiently and meet security requirements (this is kind of a “must” when building modern, scalable, server-side applications), each individual microservice must have implemented appropriate mechanisms that ensure this.
What is important here is that all aspects related to proper communication and interaction between microservices are complex. Theoretically, they could be handled by the microservices' development teams through appropriate implementations within the services themselves, but this is not an optimal approach.
Firstly, because these tasks are rather time-consuming, they require some effort and are therefore prone to errors. Secondly, developers should rather focus on implementing the actual business logic of the given microservice.
And here the service mesh concept comes in handy. This term is used to describe a kind of network that connects all microservices within a given application. In fact, it is extra software that acts as an intermediary layer between services and provides functions such as service discovery, service-to-service authentication, load balancing and traffic management, monitoring, resilience, etc.
The logical architecture of this system is quite simple because it only contains two components: a control plane and a distributed dataplane (see Figure 1 below). The dataplane consists of proxies installed next to each service instance. They basically act as L7 proxies (working in both “forward” and “reverse” mode) and handle calls to and from the services. When some particular service mesh implementation (such as Istio or Linkerd for example) is deployed in the Kubernetes cluster the proxies are sidecar containers that run in the same pods as the service containers. The control plane provides the core functionality of the service mesh system. In short, it configures and coordinates the behavior of the proxies.
Fig. 1 Service mesh architecture overview
So what are the features a service mesh can offer? Well, depending on the specific implementation you want to use, they may differ slightly. However, the key functions can be indicated below (they have been broken down into categories).
Routing and Traffic Management
- performing dynamic service discovery and proxying/L7-load-balancing for various types of protocols, e.g. HTTP/1.x, HTTP2 (including gRPC), WebSocket and other TCP-based protocols
- supporting different load-balancing algorithms (e.g. those based on Round Robin, Least Connection, etc.) and mechanisms (percentage-based, header-based, path-based traffic splits, etc.)
Reliability and Resilience
- supporting deployments based on a canary strategy (splitting traffic based on application versions)
- defining policies for request retries and timeouts (when calling the services instances)
- enabling artificial fault injection (to test the resiliency of applications)
- collecting various types of metrics for services, e.g. request volumes, success rates, latencies, etc.
- tracing (e.g. gathering data needed to troubleshoot latency issues)
- offering verbose logging capabilities
- integration with different observability backends, such as Prometheus (monitoring), Zipkin or Jaeger (tracing), Fluentd (logging), etc.
- drawing service topology graphs
- enabling secure communication between service instances (through providing mutual TLS functionality)
- providing advanced authentication and authorization mechanisms (sophisticated policies for enforcing fine-grained security control within the cluster of microservices)
To answer this question, the CNCF Survey 2020 report will be our reference here. As stated in this report:
“CNCF surveyed its extended community during May and June 2020 and received 1,324 responses. (...) Two-thirds of respondents were from organizations with more than 100 employees, and 30% were from organizations with more than 5,000 employees, showing a strong enterprise representation. The majority of respondents (56%) came from Software/Technology organizations. Other industries represented include Financial Services (9%), Consulting (6%), and Telecommunications (5%).”
The results of the survey show that in 2020 the use of a service mesh in production has grown from 18% to 27% over last year (50% increase). This growth is expected to continue over the next few years as 23% of respondents are currently evaluating service mesh-based solutions and another 19% have declared they will start using a service mesh in the near future.
Fig. 2 Interest in a service mesh amongst organizations (source: CNCF Survey 2020)
A service mesh has an application-centric focus (layer 7 of OSI model with protocols like HTTPS or gRPC and east-west traffic within the Kubernetes cluster) and solves many challenges related to higher-level networking. But what about when there are use-cases requiring lower level networking functionality, e.g. requiring L2/L3 network features or connectivity spanning outside the K8s cluster domain? Kubernetes by itself does not provide a solution as it concentrates on container orchestration functionality and not service meshes (as already explained).
Here comes Network Service Mesh (NSM), which aims to offer connectivity, observability, security, configurability and discoverability for lower layers, on a network service level (NSM focuses on processing and forwarding of frames and/or packets rather than terminating connections and providing application services).
As its name indicates, Network Service Mesh has been inspired by and has many analogies to the service mesh concept. It is not another service mesh implementation but a parallel solution which, in fact, can interact well (in the sense it can be used in the same cluster) with a service mesh like Istio.
Fig. 3 Comparison of a service mesh and Network Service Mesh - the network layers on which they work
Network Service Mesh gives applications/workloads much finer-grained control over the lower level networking stack compared to a service mesh or standard Kubernetes CNI, also simplifying usage of low-level features. It follows cloud native concepts with a declarative configuration, allowing it to describe the intended network state (which is then deployed and configured) - one could say, the standard “Kubernetes way”. No changes in applications are required to start working with NSM - pods can leverage NSM features by declaring which Network Services they are part of (and Network Service resources are defined in the same manner).
Network Service Mesh is an open source project, being part of the Cloud Native Computing Foundation (CNCF). It is also used as an example use-case in the CNCF Testbed.
NSM has been created to solve some limitations of existing networking models in cloud native environments:
It supports multi-cluster connectivity and connectivity for hybrid environments (e.g. K8s and VMs)
- Application workloads can connect to Network Service(s), independent of where they run
- Applications can connect to multiple service meshes at the same time
- NSM can provide an inter-cluster connectivity domain for a service mesh like Istio
It allows easy creation of Service Function Chaining (aka service composition)
In the NFV context, NSM can provide support for high bandwidth and highly configurable environments
It provides support for non-standard protocols (e.g. proprietary DB replication protocols)
These are example use-cases. What is worth noting, NSM can support complex networking cases not possible with “standard” solutions (dealing only with higher layers of the networking stack)
The NSM solution is not tied to a particular runtime domain (for example, it can be used in the VMs context as well as in K8s) though in this article we focus on container environments, such as Kubernetes.
Network Service Mesh provides additional features to K8s, though it does not replace the existing K8s networking model, CNI. Instead, both CNI and NSM can work in parallel. Also, NSM is complementary to traditional service meshes like Linkerd, Istio or Consul.
With the Network Service Mesh concept one can distinguish elements:
- Network Service Client (NSC) or simply Client - is an application workload which connects to Network Service (Client can connect to many Network Services at the same time). A Client can be a Pod, VM or even physical server.
- Network Service Endpoint (NSE) or simply Endpoint - provides Network Service to a Client. Can be realized as a local Pod, a remote Pod (in a different cluster than the one where the client pod is located), a VM, any other function that processes packets, etc.
- vWire (virtual Wire) - connects a Client to an Endpoint (carries frames/packets between the Client and Endpoint). vWire provides simple functionality: a packet entering vWire at one end (ingress) will leave at the other end (egress).
- Network Service - is defined as a collection of connectivity, security, and observability features applied to traffic. In its most basic form, it is just a distributed L3 domain that allows the workloads to communicate via IP.
Fig. 4 Network Service Mesh components in a Kubernetes environment - high level view
Network Service Mesh components for Kubernetes environments are depicted in Fig. 4. (together with example Network Services). Their roles can be explained as follows:
- Network Service Registries (NSR) - contains a list of available Network Services and Network Service Endpoints. Additionally NSM architecture supports Registry Domains, allowing multiple independent registries to coexist.
- Network Service Manager (NSMgr) - is a control plane component (deployed as a daemon set on the K8s cluster) responsible for forming a full mesh, by establishing communication with other Managers (NSMgr) within a given domain. It manages Network Service requests coming from clients’ pods and the process of creating a vWire between client and endpoint.
- Network Service Mesh Forwarder - a dataplane component, responsible for providing forwarding mechanisms. NSM can use forwarding solutions like VPP, SR-IOV, kernel networking, etc.
- Admission Webhook - Network Service Mesh uses the K8s Admission Controller approach to monitor deployment of client pods and reacts when they (i.e. corresponding client pods deployment manifest files) include annotations related to NSM. In such a case, Admission Webhook adds an NSM init container to the pod which is responsible for setting up the requested network service (the NSM init container negotiates with NSMgr to accomplish this process and as a result a Network Service interface is injected into the client pod). The process is transparent from a client pod perspective.
To make it work, two API endpoints are added to K8s:
- Network Service API - used to Request, Close, or Monitor vWire Connections between a Client and Endpoint providing the requested Network Service
- Registry API - Used to Register, UnRegister, and Find Network Services and the Network Service Endpoints that provide them
Additionally NSM integrates with Spiffe/Spire to provide authentication and authorization functionality (this allows fine-grained security configuration, e.g. the workload can be connected only to the required Network Service(s) and separated from any other).
Deployment scripts and manifests support different types of K8s environments including local ones, GKE (on GCP), AKS (on Azure), EKS (on AWS).
The official repository contains several examples for deployment configuration, starting with a basic deployment but also more advanced NSM features and use-cases. Based on those examples you can build your own solutions (by taking and modifying the required elements).
Is Network Service Mesh a service mesh? Well, strictly speaking, it is not. NSM extends the service mesh idea by introducing similar concepts into the lower level network stack. It should be seen as a complementary solution (which co-exists with current network models, including the K8s CNI and service meshes), allowing the support of complex network use-cases in a cloud-native fashion.