Debugging Nginx Ingress in Kubernetes – a study in (Codi)Lime
This story comes with everything one needs to tell the perfect noir detective story. There’s an investigation, a mysterious victim and a silent psycho mass-murderer. Only the setting is changed, with Kubernetes clusters instead of Victorian era London and the Codilime team smoking Sherlock Holmes’ pipe.
So pour yourself some whiskey, light up a cigar and enjoy your reading!
Kubernetes is currently one of the most popular open-source systems for deploying and managing applications. Yet it wouldn’t be so useful without Ingress, a tool that enables the outer world to contact the components within Kubernetes by using HTTP or an HTTPS protocol. Ingress enables the outer front-end to communicate with the Kubernetes clusters-based app.
Ingress can be configured to give services an externally reachable URL, load balance traffic or offer name virtual hosting, among others. All the dirty work is done by the Ingress controller. There is no consensus about how an Ingress controller should be developed, so the community provides a bunch of solutions. On the other hand, Kubernetes provides a set of rules to follow when developing the controller, so there is no risk it won’t work with Ingress.
There are currently a few notable Ingress controllers:
- Nginx Ingress Controller – based on Nginx – a lightweight, high-performance web server/reverse proxy. Battle tested and extremely flexible.
- Kubernetes Ingress Controller for Kong – an Ingress controller based on Kong API Gateway. It combines all Kong API management capabilities with a Kubernetes native approach to ingress management.
- AMBASSADOR – an open-source Kubernetes-Native API Gateway built on the Envoy Proxy
- Traefik – an alternative to Nginx, reverse proxy, and load balancer, focused on making the process easier and has built-in Let’s Encrypt support.
Most cloud providers, including Amazon Web Services, Microsoft Azure and Google Cloud Platform provide their own ingress controllers, which integrate directly with their own cloud solutions.
In our story, Nginx is the one to follow. CodiLime uses Nginx, as it is a proven and reliable technology used (also as a web server) by more than 24% of all websites on the Internet. It’s fast, comes with a low memory footprint and supports all the features our projects call for.
The minute before the crime
The most common way of working with an ingress controller is not different to a traditional web server configuration – done once and kept forever (or at least that is the hope). Run it and forget about it, especially considering the relatively static nature of most web-based services.
Unlike with some common trends, our project requires a bit of a different approach. The platform we build enables users to create workspaces for data scientists. Establishing a new project requires assigning a particular amount of computing power for machine learning purposes. This changes the configuration behind the cloud. The new project becomes a new entity, accessible for a new group of users, who may or may not need access to the rest of the resources. To do that, ingress needs to be reconfigured and reloaded.
Long story short – the ingress configuration needs to be changed. And that’s where Nginx shines. By most accounts, it never loses a connection after reloading, and that’s a feature perfectly suited to our needs.
The mysterious victim
A seed of doubt
It exists and it grows
We designed the app to enable users to create a new workspace by sending an API call to the Kubernetes app. The backend then starts the entire process of creating the project and assigning the resources. The front end, meanwhile, waits for the confirmation of the end of the process from the backend.
The connection, alas, is lost, bringing us to the victim–the app. But who is the killer?
“He was such a polite ingress controller”
After all the hypotheses were checked and the different scenarios tested, an unexpected answer came. Yeah, it’s true, the Nginx keeps the connection after reloading. But for no longer than ten seconds.
Basically, the frontend sent the API call to the backend and waited for a response. The backend launched the procedure for building a workspace for the project. When all the work was done, the backend would notify the frontend that the job had been handled.
But no one was waiting there.
The Nginx’s reputation as a web server and ingress controller that could keep the connection after a reload was not necessarily wrong. The webserver’s reload process is based on workers. New ones are established just after the reload to support new connections. And the old ones, kept while reloading is killed one-by-one with no mercy. Thus, the connection is kept, but only for ten seconds. So much for the smooth transition for every typical HTTP and HTTPS connection.
But not this time. Our app was not typical. The process of establishing new workspace took up to 30 seconds, apparently 20 seconds longer than the web server supported. Basically, our frontend was waiting for the signal from the connection that was already lost.
A tricky beast you are, Nginx. The crime was almost perfect, but this time, Nginx, you messed with the wrong guys.
So what did we do? In the end, it was easier than we thought it would be: the timeout was manageable; we just changed it from ten seconds to ten minutes, which is more than enough time to handle the process.
Summary, keep calm and test out all the possibilities
The moral of the story may be this: never trust common knowledge. You never know if that polite Nginx, despite its squeaky clean reputation, isn’t killing processes you need right under your nose. We can almost hear it:
I didn’t know, I couldn’t hear the answer
My mind was blank, I should have known
Hold it back but somehow
There is someone else, another stranger me
Sometimes it appears that there is more detective than programming work in networking (at least using Kubernetes). So what is our advice when we take off our longcoats and hats to sit back down and tackle the backlog on our laptops?
The worst way to go about solving problems is to panic or act without a plan. To keep your sanity, remember to:
- Make a hypothesis and then test it out on a minimal product. Keep the model simple enough not to include the unnecessary variables, but complex enough to provide answers.
- Don’t launch long-running tasks in your web worker app. Consider some async communication such as web sockets
I hope you enjoyed the story. All the quotations I used came from the Blind Guardian’s song Another Stranger Me.
If you are interested in our cloud native services, click here: Codilime’s Cloud Native Services.