Implementing a mechanism to preserve the performance and health of a Node.js Fastify application deployed to Kubernetes.
We continue to explore the benefits of the Fastify plugin under-pressure. Previously we used a custom Prometheus metric to build a simple backpressure mechanism in a Fastify application; now we look at integrating our backpressure mechanism into our infrastructure.
The sample code for both posts is available at nearform/backpressure-example . The code for this part is in the part-2
branch. Check out the sample code for this part:
Requirements
Our infrastructure will consist of a Kubernetes workload deployed via Helm. It also requires Docker to create the image of the application that we’ll deploy to the cluster. If you don’t have a Kubernetes cluster available, a simple way to run a cluster in your local environment is to use Docker Desktop, which includes Kubernetes.
You can follow each individual tool’s setup instructions:
Once you’re set up, the following CLI programs should be available in your terminal:
docker
kubectl
helm
If you prefer to follow along without installing the tools, simply read on and look at the accompanying source code.
Kubernetes liveness and readiness probes
In the first part of the article we decided that we would open the circuit when the response times of our application’s /slow endpoint exceeded 4 times the expected response time of 200ms. When this happened, we returned a 503 Service Unavailable
HTTP error via under-pressure .
This is a safety mechanism to prevent the application from being overwhelmed with requests, and not something that should happen when our application runs in production.
Instead, we want to make sure that our infrastructure stops serving requests to the application before we reach that point. To do this, we’ll use Kubernetes probes .
We’ll change the application code so that it exposes two additional endpoints, named /liveness
and /readiness
.
The /liveness
endpoint is the simplest one because, based on how it’s expected to work from a Kubernetes’ perspective, it should always return a successful response in our case.
The /readiness endpoint is more interesting because, based on its response, Kubernetes decides whether to serve requests to the Pod or not.
Earlier, we configured our safety mechanism to stop accepting requests at a threshold 4 times above the expected response times. Intuitively, we want to configure the readiness probe at a lower threshold — for example, twice the expected response time.
To do so, we change our application’s slow.js
module as follows:
We also encapsulate our custom Prometheus metric in its own module, which now reads as:
Then, in the root of our application we create the two endpoints that will be used by Kubernetes probes:
The readiness probe is configured to respond with an error before the circuit opens, so we are relying on the infrastructure to stop serving requests when the probe delivers such a response.
The circuit breaker is a safety net in case the infrastructure doesn’t respond quickly enough.
The liveness probe is simpler because it will always return a successful response. Our example has no errors from which the application cannot recover. A more realistic implementation of the liveness probe would take into account additional factors, such as a database connection that cannot be established, which would cause the application to be permanently unhealthy. In that case, the liveness endpoint should return an error.
Deploying the application to Kubernetes
The first thing we need to do to run our application in the Kubernetes cluster is create an image of the application using Docker:
Then we install all the Helm charts needed in our example, which includes the application and other services, which we’ll look at later.
Finally, we can check the local port on which the application is running by executing:
The command above gives an output similar to:
We can now access the application at https://localhost:31470 (the port will most likely be different on your machine).
Triggering the readiness probe
The configuration for the Kubernetes deployment can be found in the source code repository accompanying this article. The relevant section of the configuration file is:
This configures the liveness and the readiness probes. We’re now going to trigger the readiness probe by putting the application under load via autocannon as we’ve done in the first part of this article.
If you haven’t used autocannon before, you can install it via npm:
Before hitting the application, let’s keep an eye on the status of the Kubernetes deployment so we can check when the single Pod we currently have turns from ready to non-ready due to the readiness probe:
This will show an output similar to:
The above output means that there is 1 Pod ready out of a total of 1 Pods, which is what we expect because only one is deployed.
In another terminal window, we can now run autocannon in the usual way, making sure to use the HTTP port the service is bound to on our host machine:
To make the /readiness
endpoint return an error status code, we need to put enough load on the application to make the .999th percentile of the requests last at least 400ms, which is the threshold we configured.
You can check how long requests are taking by hitting the /metrics
endpoint in your browser and by changing the autocannon options accordingly.
When the threshold is reached, Kubernetes will detect that the application is reporting that it’s not ready to receive more requests and will remove the Pod from the load balancer. The output of the earlier kubectl get deployment
command will show something like this:
When the autocannon run completes, the application will reflect the shorter response times in the metrics values, which will cause Kubernetes to detect a successful readiness probe and put the Pod back into the load balancer:
Up to this point we’ve achieved the ability to stop overloading the application by means of an internal circuit breaker and via Kubernetes’ readiness probe. The next step is to automatically scale the application based on load.
Exposing custom metrics
To allow Kubernetes to scale our application, we will need to expose custom metrics that can be used by Kubernetes’ Horizontal Pod Autoscaler (HPA).
By default, the autoscaler can use a range of metrics built into Kubernetes, and we could use those metrics for autoscaling. In our example, we want to use a custom metric. Therefore, we need to make sure we expose that metric to Kubernetes and make it available to the autoscaler.
We achieve that by using Prometheus Adapter , which is already running inside our Helm deployment.
The relevant section of the configuration is:
With this configuration we can then query the metric:
This will provide an output similar to:
The output above shows a value of 0 for the http_request_duration_seconds
, which is the name of the metric we expose and which maps to the .999th percentile reported by our custom metric.
If you try hitting the /slow
endpoint manually or with autocannon, you will see the value of the metric reflect the value reported by the /metrics
endpoint. The values will not be in sync because there is a certain delay in the update of the Kubernetes metric due to polling and propagation of the metric from the application to Prometheus and then from Prometheus to Kubernetes.
Autoscaling
The last step in getting our infrastructure to handle the increasing load on the application properly is to enable automatic scaling via Kubernetes’ Horizontal Pod Autoscaler
.
This requires a simple change in our Helm chart, which deploys a resource of type HorizontalPodAutoscaler
. We will include an additional chart in our Helm deployment. This is available in the branch part-2-hpa
.
The autoscaler will need metrics upon which to carry out the auto scaling logic. In our case, it will be our custom metric:
We have configured a minimum of 1 and a maximum of 4 replicas for the Pods running our application and a target value of 300ms for the custom metric exposed to Kubernetes via the Prometheus Adapter.
We can test the behaviour of the autoscaler by upgrading our deployment with:
We can now run autocannon against the application and, by watching the value of the /metrics
endpoint, increase the load so that the response times go above 300ms.
If we keep an eye on the deployment...
...we will see that when the metric value exceeds the threshold, the autoscaler will increase the number of Pods:
To confirm this, we can look at the output of:
This will tell us the reason why the autoscaler increased the number of replicas:
Putting it all together
Here is a summary of how our application will behave using the circuit breaker, the readiness probe and the autoscaler:
- When the average value across Pods of the .999th percentile of the response time is above 300ms, the autoscaler will increase the replicas up to a maximum of 4.
- When the .999th percentile of the response times of each single Pod is above 400ms, the Pod will fail the readiness probe and will be taken out of the load balancer by Kubernetes. It will be added back to the load balancer when the response times decrease below the threshold.
- When the .999th percentile of the response times of each single Pod is above 800ms, the application’s circuit breaker will open as a safety mechanism, and the application will reject further requests until the circuit is closed. This happens when the response times fall below the threshold and is handled by
under-pressure
.
Though seemingly arbitrary, the threshold values are chosen so that:
- The autoscaler kicks in first (300ms threshold).
- If for any reason a Pod keeps receiving more requests than it can handle, it will fail the readiness probe, causing Kubernetes to stop serving it requests (400ms) in order to preserve the responsiveness of the Pod for the outstanding requests.
- If for any reason a Pod keeps being served requests despite failing the readiness probe, it will trigger the circuit breaker which will cause further requests to be rejected (800ms).
In this pair of articles, we’ve outlined how to create a complex mechanism capable of preserving the performance and health of a Node.js Fastify application deployed to Kubernetes.
The mechanism consisted of an in-application circuit breaker implemented via under-pressure, a readiness probe handled by Kubernetes and an autoscaling algorithm handled by Kubernetes HPA.
We used a custom metric calculated and exposed via Prometheus to define whether the application was healthy and responsive.
This allowed us to scale our application automatically when the response times increased, preserve the performance of the application by temporarily excluding it from the load balancer when response times were higher than normal and stop responding to requests when doing so would compromise the health of the application.
Insight, imagination and expertly engineered solutions to accelerate and sustain progress.
Contact