Today I want to talk about Thanos, a hero, that will help us with an impossible mission: A production-grade Prometheus deployment. Prometheus is an amazing tool, that can do a lot of things – from metrics to alerting. But there is one problem that is a bit harder to solve – longer-term storage for Prometheus metrics. And, after all, having metrics only for a day or two, is not that useful. And this is where Thanos fit in.
At Soluto, where I work, we have multiple production Kubernetes clusters. On each one of those clusters, we have a Prometheus instance, that is responsible for collecting metrics and monitoring everything on this cluster. This raise 2 interesting problems:
Long-term persistent storage for Prometheus metrics
Centralized instance, where you can view metrics from each one of the clusters.
Our first solution was simple but powerful. Prometheus can scrape metrics from another Prometheus instance – a mechanism called “Federation“:
As you can see, we ended up having one central Prometheus instance, that scrapes each one of the Prometheus instances running on our production clusters. Users could access it to view metrics directly using the UI or using Grafana, and our monitoring system (Icinga2) could query it and raise alerts.
This setup also helped with persistence – all we had to do is having a persistent disk on the VM running the centralized instance. Prometheus would take care of the rest. Now we have a solution for both of the issues. A solution that is working, but is far from perfection.
Scrape Duration Issues
In this setup, every minute the centralize Prometheus instance will scrape each one the instances running on the clusters. It works fine if the time took to scrape an instance (scrape duration) is less than 1 minute, but what happened when it becomes longer? We will start to lose metrics. So, we had to monitor Prometheus scrape duration, and take action if it becomes too long.
One of the reasons for the increase in scrape duration is the number of metrics to scrape. So we had to put a lot of effort into limiting the number of metrics each service can expose, or just look for the biggest metrics and try to remove them. This requires some maintenance work.
This leads to the second issue – the need to maintain a Prometheus instance running outside the cluster. It wasn’t very hard, but it was a lot harder compared to the instances running on the cluster.
So we started to look for alternatives. And there are a lot of alternatives out there, but the most popular solutions are Thanos and Cortex. Both solutions solve the same problem – a persistent layer for multi-cluster Prometheus instance, but Thanos seems a bit simple to use, so I decided to start testing it.
Thanos is indeed simpler compared to Cortex, but it still a complex system. For the most simpler setup, you need 3 new components:
Thanos sidecar, running on each Prometheus instance. The sidecar is responsible to upload Prometheus data every 2 hours to a bucket (in our case, AWS S3). It is also responsible to serve real-time metrics that are still not in the bucket.
Thanos store, responsible to serve metrics from the bucket.
Thanos query/querier, responsible for handling Prometheus query API. It also has a UI that is very similar to Prometheus UI and can do more things like deduping (in case you have more than one Prometheus instance, for high-availability). The querier query both the store and the sidecar (based on the query time) and return the relevant metrics.
This is a lot of moving parts that you need to deploy and monitor. It is a bit simpler using kube-thanos and Thanos mixin, which makes it easier to deploy all the required components and monitor them. But this solves only one part of the problem – we now have persistency for Prometheus, but what about a centralized instance? The setup described here works only for a single cluster!
Multi-cluster Thanos Setup
A multi-cluster setup is just a bit more complex. The querier instance can be stacked – meaning, it can query other query instances. So, we can have another querier instance that query all the querier instances running on all clusters:
Why we need another querier?
Service discovery: Thanos querier can DNS service discovery to discover the relevant targets. So our multi-cluster querier can use a single SRV DNS records to discover all the querier instances running on all the clusters and query them for metrics. If we had only one querier, we had to figure out how to do discovery in a way that a querier instance will not query itself.
Security. Because we are going outside the cluster, authentication is important – as we can’t simply solve it with network policies (we can, but it’s a bit harder). Thanos querier support mTLS authentication, but it enforces mTLS for all the stores it’s query. So, if we had one querier instance, we had to do mTLS also between Thanos store and querier and Thanos sidecar and querier. A lot of headaches that we don’t really need.
The multi-cluster querier instance can run on all the clusters, as it’s stateless. Now we can have one route53 records that point to the multi-cluster querier (using one of the supported routing methods that work for you, we choose to go with failover). This endpoint is a full replacement of the old centralized Prometheus instance and can be used from Grafana or Icinga. And, because the UI is very similar to Prometheus, devs can use it without noticing.
Keeping Old Data
The last thing we need to solve is what to do with historical data that exist on our old centralized Prometheus instance. One solution is just to run Thanos alongside Prometheus instance until it keeps enough data and we can get rid of it, but this takes too long. If you’re feeling brave enough, Thanos sidecar can load historical data to the bucket, and by this solves the issue for you. Just run it, load all the data and kill the old instance. One thing to notice – this feature is still experimental.
An alternative is to run the sidecar on the centralized Prometheus instance in “read-only” mode – without bucket access. We can connect this sidecar also to the multi-cluster Thanos instance and now we have all the historical data exist also there. The querier will fanout the query to all the querier in the clusters and to the sidecar, deduplicate and return the results
Now we have a full production Thanos deployment that can replace our old centralized Prometheus instance. Scrape duration is no longer a problem (finally!) and we can focus on other things. Setting up Thanos was not easy – but I hope that sharing my journey will help you set it up. Feel free to share your experience! Did you do something similar?