AppSec Learning SRE principles: Metrics and Measurements

The field of SRE (site reliability engineering) – is relatively matured, and there is a lot written about it. Especially, Google released a really good book discussing how SRE works at Google. What can we, AppSec engineers (ASE?), can learn from SRE principles to improve our field? Today I want to focus on one aspect: metrics and measurements – something that is my personal focus right now. What metrics SRE define and measure and can we define something similar?

It’s all about SL*

Chapter 4 of Google’s SRE book focus on defining service level as a measurement tool. I know, this is not a SRE post, but bare with me for a few moments, so we could all speak the same language. What are those service levels?
  • SLI – service level indicator: “A carefully defined quantitative measure of some aspect of the level of service that is provided”. SLI is just a fancy name for a metric that lets us measure how well a service is functioning. For example, the percentage of failed requests, out of all requests.
  • SLO – service level objective: “A target value or range of values for a service level that is measured by an SLI”. Given an SLI, the SLO defines what should be the value (or range of values) for these metrics. For example – given the previous SLI (percentage of failed requests), one can define an SLO of 5%.
  • SLA – service level agreement: “An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain”. SLA is how we communicate outside our SLOs. Giving the previous example, our SLO should be 5%, but we can decide our SLA is a bit less strict – for example 10%. This gives us some flexibility because usually breaching an SLA usually has a business impact (for example, our SLA might be part of the contract we signed with our users).
Ok, so this is interesting – but we are AppSec engineers – not SREs – why should we care about all those SL*?

Measuring our AppSec program

You can’t manage what you can’t measure

Peter Drucker
Defining metrics is critical to a successful AppSec program (And you can see that bot SAMM and BSIMM specify it). Measuring our AppSec program serves 2 purposes:
  • Allows us to declare what is important to us, and clearly communicate it
  • Allows us to track our progress, especially to measure if our money/time is well spent
Which gets us to one of the most challenging questions – what should we measure? And here, there is no clear framework, like we have for SRE. But can we reuse this framework? What if we try to measure a service risk like we measure service level? We could have the following:
  • SRI – service risk indicator: a carefully defined quantitative measure of some aspect of the risk of service that is provided (almost the same definition). This should be a single metric that allows us to measure the risk of a given service (or maybe a team/company/feature?). Here the examples could be – number of monthly security incidents, number of monthly reports from a bug bounty, SAST/DAST input, PRs with security code review, or maybe even percentage of features that conduct a threat model.
  • SRL – service risk level: A target value or range of values for a service risk that is measured by an SRI. This is where things become interesting: What if we a team define their own SRIs? One team might need a stricter SRO, and choose the number monthly of reports as their SRI. This team could choose 1 as their SRO, and act accordingly.
  • SRA – An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SROs they contain. By defining SRAs, we can communicate better to our users, for example, the security of our product. Imagine having a status page, like the one you already have, just for SRAs – it gives a lot of visibility to your users about what you care about and how secure the product they’re using is.
But defining metrics is only the first step. How should we act upon this metrics? And this is where another SRE tool come in handy – error budgets.

Error Budget

Chapter 3 of Google’s SRE book discusses the idea of “Error Budget”: “The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter”. Error budget allows us to have an informative discussion about technical issues. For example, should we spend time in the next sprint in re-writing a service? Without data, it’s basically a fight between the product manager and the developers.

A discussion without Error Budget via GIPHY

So how this relate to the previous SL* discussion before? The error budget is highly tightened to our SLO. Each time the SLO is breached, some of the error budget is burned – depend on how much time it was breached. Ee can now look at when planning features. Do we think re-writing service is critical? Let’s define, together, how many of the error budget are we accepting to burn each month/week before stopping and investigating time on technical issues? 5%? 10%? The error budget lets us set the balance we agree upon between features and technology. There is no right answer here – each team has the answer right for them. But once we agreed on a number – for example, 5%, we can use this when planning and discussing what we should do next. More important, we can easily communicate outside and inside our team.

Defining the risk budget

So how this relevant to our AppSec program? I’m sure we all familiar with the regular discussion with the product managers (and with developers) about security controls. It always a hard discussion, and usually not very informed. Now that we have our SR* framework, we can define our risk budget. It will work the same: each time our SRO is breached, the error budget is burned. Now, we just need to agree on how much risk budget we are willing to burn each month/week – let’s say 5%. Planning is easy now – let’s look at how many budget we already burned and used this to decide how much to invest in AppSec. The risk budget also helps us to monitor how good our SRIs/SROs – if it is burning too slow or too fast – we should probably redefine it (which is also true for SRE and this is why it also used for monitoring – read more about it here). If we feel that we’re not spending enough on security – let’s look at our SR*. Or maybe we should redefine our risk budget? Or, even better, it is just a feeling and we actually in a pretty good state?

Putting it all together

We started with the need to define metrics for our AppSec program. Rethinking SRE frameworks like SL* or error budget and redefining them for AppSec provides us a lot of value. Defining our SRI/O/A helps us communicate what we care about – and measure it. The risk budget allows us to have an informed discussion on how much we need to invest in security – it a tool to clearly define the balance between features and security. What I shared here are just some thoughts I’m having in recent times – not something I have a lot of experience with. I’m sharing it here to start a discussion – and I’m looking forward to hearing feedback about it: Is it interesting? Something you think worth trying? What other SRE practices can we take into AppSec (hint: chaos engineering!). Please share your ideas with me – let’s discuss on OWASP slack (join us on #appsec-program channel!), twitter – or here in the comments.

Leave a Reply

Your email address will not be published. Required fields are marked *