image of a man pulling

Using GitHub PR Flow with Terraform

In the past week, I was working on enabling Google Kubernetes Engine Workload Identity on our clusters. Workload Identity is a solution for connecting Kubernetes Service Account to Google Cloud Service Account – and by this, granting specific permissions to a specific workload on the cluster. While enabling workload identity is relatively simple, the hard question is how we enable it in scale – how we let devs use it easily and securely. And this is where Terraform come handy: using it, I can easily build an abstraction (=module) that developers could use to create all the resource required for workload identity. Writing this module allows me to carefully choose what to expose, building a paved road to be used by the developers. Finally, there are very interesting developments in the are of SAST for Terraform (see this talk, as one example) – making it an even more interesting tool. So, I decided to try and use Terraform for this. Writing the module was pretty easy (there are even public modules that exist, like this one), but how devs will use it? This is where GitHub PR flow come handy: Using the pull request mechanism (PR), we let everyone to ask permissions (self-service) while ensuring those changes go through a defined process of reviewing and testing before applying them. Let’s see how we can build the same flow for terraform! Continue reading “Using GitHub PR Flow with Terraform”

AppSec Learning SRE principles: Metrics and Measurements

The field of SRE (site reliability engineering) – is relatively matured, and there is a lot written about it. Especially, Google released a really good book discussing how SRE works at Google. What can we, AppSec engineers (ASE?), can learn from SRE principles to improve our field? Today I want to focus on one aspect: metrics and measurements – something that is my personal focus right now. What metrics SRE define and measure and can we define something similar? Continue reading “AppSec Learning SRE principles: Metrics and Measurements”
kid building in lego lego

Extending Kubernetes with CRDs – The Hard Way

This is a post I was planning to write a while ago when I worked on Kamus CRD feature. CRD, or Custom Resource Definition, is a way to extend Kubernetes with a new resource. In my case, I wanted to add a new resource, KamusSecret, which is very similar to a regular Secret, just encrypted. Let’s see how this can be easily done – using my beloved language, C# 🙂 Continue reading “Extending Kubernetes with CRDs – The Hard Way”
Thanos - a hero

A Production Thanos Deployment

Today I want to talk about Thanos, a hero, that will help us with an impossible mission: A production-grade Prometheus deployment. Prometheus is an amazing tool, that can do a lot of things – from metrics to alerting. But there is one problem that is a bit harder to solve – longer-term storage for Prometheus metrics. And, after all, having metrics only for a day or two, is not that useful. And this is where Thanos fit in. Continue reading “A Production Thanos Deployment”

Solving Trust Issues at Scale

Microservices are social constructs: they can’t function without talking with other services. This also raises an interesting question: do we trust all of our microservices? Not all microservices are the same: some are more sensitive – for example, services that handle personal user data or payment information. Others are user-facing and therefore riskier. We shouldn’t treat all services as equal. A robust mechanism that describes who can talk with who is required. Let’s see how! Continue reading “Solving Trust Issues at Scale”

Istio in Production?

Istio is one of the most popular service mesh. It can help in solving many issues that surface when running a lot of microservices – things like authentication, authorization, observability and traffic routing. It all sounds really promising, so we decided to give it a try at Soluto. During the process of deploying it on an existing cluster and enabling it on existing workloads, I faced a lot of interesting issues. Let me share some of them with you. Continue reading “Istio in Production?”

Do we really need threat modeling?

I’m a huge fun of threat modeling. It’s a very powerful tool, that can find a lot of security issues. If you’re not familiar with it, check out my earlier post on the subject. For the past few years, I was struggling with one simple question: when should we conduct threat modeling? After all, threat modeling has a price – it takes time to conduct it, and usually involve a few peoples. We can’t conduct a full threat model for every feature – we need to find a way to identify the “interesting” features that require a threat model.

One very interesting solution to this hard problem was proposed by Izar Tarandach in this talk. In short, he proposes to tag features as “threat model worthy”, and once in a while go over all the features with this tag and review them. This is a really interesting approach, and I highly recommend you to watch the entire talk. However, from my experience, it’s not a silver bullet for this problem, and I want to propose an alternative approach.

Continue reading “Do we really need threat modeling?”

Debugging iOS apps with Zaproxy

The other day I was debugging a really nasty bug that happens only in our iOS app. I was really frustrated because I couldn’t figure out why it happens. Everything looks good when debugging the iOS code, but for some reason – the server failed to deserialize the request body. I freaked out – nothing I tried seems to solve the issue. If only there was an easy way to view the actual request and response, maybe I could understand what the issue was…

This is where a proxy comes in handy: A proxy can inspect the traffic and print it an easy to understand manner. There are a lot of available proxies you can use (like Charles (commercial) or Fidler), but OWASP Zaproxy (Zap) is the best open source proxy that I know. Let’s see how easy it is to set it up:

Continue reading “Debugging iOS apps with Zaproxy”
Utilisation - illustration

Monitoring Kubernetes HPA Utilization

In the past few weeks, I was working on migrating a legacy micro-service to Kubernetes platform. The migration process was relatively simple – mainly migrating the code from .NET 4.5 framework to .NET core 2.2. After making sure the service is deployed and working is expected, I started to gradually move production traffic to the new instance. The new service handle the traffic well, and I was happy – look like this task is about to complete!

After a few days of a gradual rollout, I felt good enough to move all the traffic to the new service. And then it hit me: will the new service be able to handle the load of production traffic? I mean, I configured a Horizontal Pod Autoscaler (HPA) for this service – but does it enough? Apparently – no. But before I’ll explain why, let’s do a quick recap on HPA.

Continue reading “Monitoring Kubernetes HPA Utilization”

Keeping Prometheus in Shape

Prometheus is a great monitoring tool. It can easily scrape all the services in your cluster dynamically, without any static configuration. For me, the move from manual metrics shipping to Prometheus was magical. But, like any other technology we’re using, Prometheus need special care an love. If not handled properly, it can easily get out of shape. Why does it happen? And how can we keep it in shape? Let’s first do a quick recap of how Prometheus works.

Prometheus Monitoring Model

Prometheus works differently from other monitoring systems – it uses pull over push model. The push model is simple: Just push metrics from your code directly to the monitoring system, for example – Graphite.

Pull model is fundamentally different – the service exposes metrics on a specific endpoint, and Prometheus scrapes them once in a while (the scrape interval – see here how to configure it). While there are reasons to prefer push over the pull model, it has its own challenges: Each metric scrape operation can take time; what happens if it the scrape take longer then the scrape interval?

For example, let’s say Prometheus is configured to scrape its targets (that’s how services are called in Prometheus language) once in 20 seconds; what will happen if one scrape takes more then 20 seconds? The result is out of order metrics: instead of having a data point every 20 seconds, it will be every time the scrape completed. What can we do?

Continue reading “Keeping Prometheus in Shape”