Keeping Prometheus in Shape

Reading Time: 4 minutes

Prometheus is a great monitoring tool. It can easily scrape all the services in your cluster dynamically, without any static configuration. For me, the move from manual metrics shipping to Prometheus was magical. But, like any other technology we’re using, Prometheus need special care an love. If not handled properly, it can easily get out of shape. Why does it happen? And how can we keep it in shape? Let’s first do a quick recap of how Prometheus works.

Prometheus Monitoring Model

Prometheus works differently from other monitoring systems – it uses pull over push model. The push model is simple: Just push metrics from your code directly to the monitoring system, for example – Graphite.

Pull model is fundamentally different – the service exposes metrics on a specific endpoint, and Prometheus scrapes them once in a while (the scrape interval – see here how to configure it). While there are reasons to prefer push over the pull model, it has its own challenges: Each metric scrape operation can take time; what happens if it the scrape take longer then the scrape interval?

For example, let’s say Prometheus is configured to scrape its targets (that’s how services are called in Prometheus language) once in 20 seconds; what will happen if one scrape takes more then 20 seconds? The result is out of order metrics: instead of having a data point every 20 seconds, it will be every time the scrape completed. What can we do?

Monitoring the Scrape Duration

The first step is visibility to the scrape duration, using a very simple query:

the scrape duration graph
avg(scrape_duration_seconds)

The next step is setting an alert – for example, using Prometheus Alert Manager:

alert: ScrapeDuration
expr: max(scrape_duration_seconds) > 15
for: 5m
labels:
  severity: high
annotations:
  summary: "Prometheus Scrape Duration is getting near the limit"

Adjust the threshold based on the scrape_interval configuration – the threshold should be lower than this value. E.g. if the scrape interval is 20 seconds, raise an alert when the scrape duration is 15 seconds. Now that we have an alert it’s time to ask: what we can do when the alert raised?

Taking Action

When the scrape duration is getting near the limit we need to take an action. What we can do? One option is to increase the scrape interval. This is a valid solution, but it has one downside – losing metrics. The scrape interval define the metrics resolution – you cannot have the smallest metrics resolution than the scrape interval. For example, if the scrape interval is one minute, you will have one data point per minute. So increasing scrape interval is possible, but limited – at some point you can’t increase it any more.

An alternative is to investigate what makes scrape duration to increase. One possible reason is a resource limit – for example, not enough CPU. You can use queries like rate(process_cpu_seconds_total{job='prometheus'}[5m])*100for CPU usage or rate(process_resident_memory_bytes{job="prometheus"}[5m]) for memory usage. Use these queries to check for spikes, and consider adding resources if needed.

Another reason could be a single scrape target with a lot of time series. For example, an API that uses a user identifier as a label – which has high cardinality (read more about it here). This can be the culprit, as it increases the load on Prometheus. The solution is to drop the problematic metrics – or modify the code, when possible. This will solve the issue – but not the root cause: a single service can create real damage to Prometheus, and this is dangerous. All that is required is one unaware developer (and I did similar mistakes in the past) who decided to use user identifier as a label (it makes sense!). How can we protect Prometheus from such a mistake?

Limiting the Scrape Size

Prometheus allows us to limit the maximum scrape size. In case one of the API will send too much metrics and breach this limit, Prometheus will not scrape it. This allows us to protect Prometheus from such cases – exactly what we wanted! It’s relatively simple to add this limit – just add scrape_limit to the job definition (you can read more about it here).

Having a limit is great, but not all namespaces are equal: For example, in namespaces like kube-system or kube-public running cluster components, like Kubernetes API server, kube-state-metrics and other. In these namespaces, we either don’t want to limit the scrape (because it’s too risky to loose metrics), or we need to set higher thresholds. It was a bit tricky, but with community help I was able to solve it.

Monitoring Scrape Size

So now we have alert on scrape duration and we limited the maximum scrape size. Are we good? Well, almost. One final question left: When the limit is breached, Prometheus will stop scraping the target until the API will expose fewer metrics. How can the team that is responsible for this API know? And more important, how they can get an alert before the limit is breached?

Alerting on the scrape size is easy by using the scrape_samples_scraped metric. Using this metric, teams can monitor how many data points are exposed by the services they own, using labels like app or kubernetes_namespace. Consider, for example, the following alert definition:

alert: TeamAwesomeScraeSampleSize
expr: max(scrape_samples_scraped[kubernetes_namespace='awesome']) > 1000
for: 5m
labels:
  severity: high
annotations:
  summary: "Oh No! One of our services is exposing too much metrics!"

Now team “awesome” will get an alert if one of their services is about to breach the limit. Other interesting metrics to consider:

  • The up metrics value will be 0 when the limit is breached – take that into consideration if you’re using up to monitor service health.
  • The prometheus_target_scrapes_exceeded_sample_limit_total metric return how many targets breached the limit – so this can be used as a generic alert.

Wrapping Up

via GIPHY

To keep Prometheus in Shape, you need to:

  • Use scrape_duration for monitoring
  • Use scrape_limit to drop problematic targets
  • Use scrape_samples_scraped to monitor the size of metrics exposed by a specific target

These are fundamental for keeping a single Prometheus instance healthy while having many different services, owned by different teams, running on our cluster.

Leave a Reply

Your email address will not be published.