Utilisation - illustration

Monitoring Kubernetes HPA Utilization

In the past few weeks, I was working on migrating a legacy micro-service to Kubernetes platform. The migration process was relatively simple – mainly migrating the code from .NET 4.5 framework to .NET core 2.2. After making sure the service is deployed and working is expected, I started to gradually move production traffic to the new instance. The new service handle the traffic well, and I was happy – look like this task is about to complete!

After a few days of a gradual rollout, I felt good enough to move all the traffic to the new service. And then it hit me: will the new service be able to handle the load of production traffic? I mean, I configured a Horizontal Pod Autoscaler (HPA) for this service – but does it enough? Apparently – no. But before I’ll explain why, let’s do a quick recap on HPA.

Continue reading “Monitoring Kubernetes HPA Utilization”

Keeping Prometheus in Shape

Prometheus is a great monitoring tool. It can easily scrape all the services in your cluster dynamically, without any static configuration. For me, the move from manual metrics shipping to Prometheus was magical. But, like any other technology we’re using, Prometheus need special care an love. If not handled properly, it can easily get out of shape. Why does it happen? And how can we keep it in shape? Let’s first do a quick recap of how Prometheus works.

Prometheus Monitoring Model

Prometheus works differently from other monitoring systems – it uses pull over push model. The push model is simple: Just push metrics from your code directly to the monitoring system, for example – Graphite.

Pull model is fundamentally different – the service exposes metrics on a specific endpoint, and Prometheus scrapes them once in a while (the scrape interval – see here how to configure it). While there are reasons to prefer push over the pull model, it has its own challenges: Each metric scrape operation can take time; what happens if it the scrape take longer then the scrape interval?

For example, let’s say Prometheus is configured to scrape its targets (that’s how services are called in Prometheus language) once in 20 seconds; what will happen if one scrape takes more then 20 seconds? The result is out of order metrics: instead of having a data point every 20 seconds, it will be every time the scrape completed. What can we do?

Continue reading “Keeping Prometheus in Shape”