It’s all about SL*
Chapter 4 of Google’s SRE book focus on defining service level as a measurement tool. I know, this is not a SRE post, but bare with me for a few moments, so we could all speak the same language. What are those service levels?- SLI – service level indicator: “A carefully defined quantitative measure of some aspect of the level of service that is provided”. SLI is just a fancy name for a metric that lets us measure how well a service is functioning. For example, the percentage of failed requests, out of all requests.
- SLO – service level objective: “A target value or range of values for a service level that is measured by an SLI”. Given an SLI, the SLO defines what should be the value (or range of values) for these metrics. For example – given the previous SLI (percentage of failed requests), one can define an SLO of 5%.
- SLA – service level agreement: “An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain”. SLA is how we communicate outside our SLOs. Giving the previous example, our SLO should be 5%, but we can decide our SLA is a bit less strict – for example 10%. This gives us some flexibility because usually breaching an SLA usually has a business impact (for example, our SLA might be part of the contract we signed with our users).
Measuring our AppSec program
You can’t manage what you can’t measurePeter DruckerDefining metrics is critical to a successful AppSec program (And you can see that bot SAMM and BSIMM specify it). Measuring our AppSec program serves 2 purposes:
- Allows us to declare what is important to us, and clearly communicate it
- Allows us to track our progress, especially to measure if our money/time is well spent
- SRI – service risk indicator: a carefully defined quantitative measure of some aspect of the risk of service that is provided (almost the same definition). This should be a single metric that allows us to measure the risk of a given service (or maybe a team/company/feature?). Here the examples could be – number of monthly security incidents, number of monthly reports from a bug bounty, SAST/DAST input, PRs with security code review, or maybe even percentage of features that conduct a threat model.
- SRL – service risk level: A target value or range of values for a service risk that is measured by an SRI. This is where things become interesting: What if we a team define their own SRIs? One team might need a stricter SRO, and choose the number monthly of reports as their SRI. This team could choose 1 as their SRO, and act accordingly.
- SRA – An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SROs they contain. By defining SRAs, we can communicate better to our users, for example, the security of our product. Imagine having a status page, like the one you already have, just for SRAs – it gives a lot of visibility to your users about what you care about and how secure the product they’re using is.
Error Budget
Chapter 3 of Google’s SRE book discusses the idea of “Error Budget”: “The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter”. Error budget allows us to have an informative discussion about technical issues. For example, should we spend time in the next sprint in re-writing a service? Without data, it’s basically a fight between the product manager and the developers.A discussion without Error Budget via GIPHY
So how this relate to the previous SL* discussion before? The error budget is highly tightened to our SLO. Each time the SLO is breached, some of the error budget is burned – depend on how much time it was breached. Ee can now look at when planning features. Do we think re-writing service is critical? Let’s define, together, how many of the error budget are we accepting to burn each month/week before stopping and investigating time on technical issues? 5%? 10%? The error budget lets us set the balance we agree upon between features and technology. There is no right answer here – each team has the answer right for them. But once we agreed on a number – for example, 5%, we can use this when planning and discussing what we should do next. More important, we can easily communicate outside and inside our team.