Source: Understanding error budget overspend – CRE Life Lessons from Google Cloud Platform
By Adrian Hilton, Alec Warner and Alex Bramley
In previous CRE Life Lessons blog posts, the Google Customer Reliability Engineering (CRE) team has spent a lot of time talking about service level objectives (SLOs), which measure whether your service is meeting its reliability targets from the point of view of its end users. Your SLO lets you specify how much downtime your service can have in a given period—for example, 43 minutes every 30 days for a service that needs to be available 99.9% of the time. This downtime allowance is your error budget. Like a household budget, it’s OK to spend this error budget over those 30 days, as long as you don’t spend more than that.
If you do run out of your error budget, either by spending a bit too much each day, or by having a major outage that blows it all at once, that tells you that your service’s users are suffering too much and it’s time to give them a break. How do you do that? Here are a few questions to consider to see if you need to recalibrate your error budget.
Your SLOs will be target values for corresponding service level indicators (SLIs), which are the measurements of the critical parts of the end-user experience. One SLI for the 99.9% available example system above might be “the percent of HTTP responses which are successful (200), out of all 20x and 50x HTTP responses.” You calculate your error budget spend by the percent of the measurement period where your service fails to reach all of its SLO targets; depending on the granularity and accuracy of your SLI measurement, this might be done on a per-minute, per-hour or even per-day basis.
When you analyze your error budget spend day-by-day, you should try to attribute the main causes of error budget spend over the measurement period:
For each of these cases, you have an objective measurement of whether the problem has been sufficiently addressed: you will expect your SLIs to stay high in circumstances where previously they plummeted.
Something else you should consider: Did the outage reflect real user pain? If you have a strong indicator that users weren’t concerned by a major outage that spent a chunk of your error budget, then you may not have to change your development practices or architecture, but you still have something to fix. Either you should determine a new, lower target level for your SLO, or you should find a different SLI that better represents the user experience.
Suppose you’re trying to run your service at a 99.9% availability level, with the corresponding 43-minute-per-month error budget, but you’re consistently failing to meet that; you’re spending 50-60 minutes per month. How much does that actually matter?
You probably have business intelligence channels for measuring customer happiness in terms of time spent on your site, purchase rate, support tickets raised and other fairly direct measurements of user happiness. Evaluate those statistics against your SLIs: Are your budget overspend periods correlated with less user happiness, and if so, what’s the correlation function? If a 50% error budget overspend corresponds to a 1% decrease in customer revenue, then you may feel that you can adjust your SLO target and aim for a 99.5% availability level, rather than spend a lot of engineering effort trying to raise your availability to the original target.
What is important in this case is to have, and document, the data used to determine the SLO target. You don’t want to fall into the trap of increasing your error budget by 50% each period because “users don’t really care”—you need to articulate the tradeoff in user happiness/spend vs. reliability in your SLO definition. An SLO specification shouldn’t just contain numbers and metric names. It should also reference the logic and data used to determine the SLO target.
It may be true that the customer is always right— but what if your service’s users are part of your company? In some cases, the overall business decision may be that continuing to build and release the software is in the best interest of the company as a whole, even if you’re consistently going over budget. The error budget spend may cause an inconvenience to employees, but failing to release new versions of the software would have a significant cost to the company that outweighs user inconvenience.
This can occur when there’s a disconnect between what the users of the software are perceived to need (for example, the 99.9% availability target of this example service) and what the executives who pay for the development of the software think these users should tolerate in the name of greater velocity.
Now that we understand what messages an error budget is telling us, in part two of this post we will look at how best to keep a positive balance.
Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!