Source: CRE Life Lessons – Good housekeeping for error budgets from Google Cloud Platform
By Adrian Hilton, Alec Warner and Alex Bramley
In a previous CRE Life Lessons blog post, we talked about the error budget for a service and the information it tells you about how close your service is to breaching its reliability targets–the service-level objectives (SLOs). Once you’ve done some digging to understand why you may be consistently overspending your error budget, it’s time to fix the root causes of the problem.
Those of us who have held a significant balance on a credit card are familiar with the large bite it can take out of a monthly household budget. Good housekeeping practice means that we should be looking to pay down as much of that debt as possible in order to shrink the monthly charge. Error budgets work the same way.
Once your error budget spend analysis identifies the major contributors to the spending rate, you should be prepared to redirect your developers’ efforts from new features to addressing those causes of spend. This might be an improved QA environment or test set to catch more release errors before they hit production, or better roll-out of automation and monitoring to detect and roll back bad releases more quickly.
The effect of this approach is likely to be that you make less frequent releases, or each release has fewer changes and hence is less likely to contain an error-budget-impacting problem. You’re slowing down release velocity temporarily in order to allow safer releasing at the original velocity in future.
Another issue to consider is: What if the error budget overspend wasn’t the developers’ fault? If your data center or cloud platform has a hardware outage, there’s not much the developers can do about it. Sure, your end users don’t care why the service broke, and you don’t want to make their lives worse, but it seems harsh to ding your developers for someone else’s failure. This should surface in your analysis of error budget spend, as described above.
What next? You may need to talk to the owners of that platform about their historical (measured) reliability and how it squares with you trying to run your service at your target SLO. It may be that changes are needed on both sides: You change your system to be able to detect and tolerate certain failures from them, and they improve detection and resolution time of the failure cases that impact you.
Often, a platform is not going to change significantly, so you have to decide how to account for that error spend in future. You may decide that it’s significant enough that you need to increase your resilience to it, e.g., by implementing (and exercising!) the option to fail your service automatically out of an affected cloud region over to an unaffected region. (See our “Defining SLOs for services with dependencies” blog post, which dealt with this problem in depth.)
It could be, however, that your analysis leads you to the conclusion that software releases are a major source of your error budget spend. Certainly, our experience at Google is that binary rollouts are one of the top sources of outages; many a postmortem starts “We rolled out a new release of the software, which we thought did <X>, which our users would love, but in fact it did <Y>, which caused users to see errors in their browser/be unable to view photos of their cat/receive 100 email bounces a minute.”
The canonical response to a series of bad releases that overspend the error budget is to impose a freeze on the release of new features. This can be seen as a last-resort action; it acknowledges the existing efforts to pay down debt have not delivered sufficient reliability improvement, so lowering the rate of change is instead required to protect user experience. A freeze of this nature can also provide the space and direction to development teams to allow them to refocus their attention away from features onto reliability improvements. However, it’s a drastic step to take.
Other ways you can avoid freezing include:
If you really have to impose a new features freeze, how long should it last? Generally, it should last until you have removed the overspend in your error budget, and have confidence it will not recur. We’ve seen two principal methods of error budget calculation: fixed intervals (say, each calendar month) and rolling intervals (the last N days).
If you operate a fixed interval for your error budget calculation, your reaction to an error budget overspend depends on when it happens. If it happens on day 1, you spend the whole month frozen; if it’s on day 28, you may not actually need to stop releasing because your next release may be in the next month, when the error budget is reset. Unless your customer is also sensitive to outages on a calendar month basis, this doesn’t seem to be a good practice to optimize your customers’ experience.
For a rolling 30-day error budget measurement period, your 99.9% available service gains the error budget lost in day N-30, so if your budget is 20 minutes overspent, now you need to wait until that 20 minutes of debt has dropped off your radar. So if you spent 15 minutes of your budget on day N-29 and five minutes on day N-28, you’d need to wait two more days to get back to a positive balance, assuming no further outages. In practice, you’d probably wait until you accumulate a buffer of 20% of your error budget so you are resilient to minor unexpected spends.
Following this guidance, if you have a major outage that spends your entire month’s budget in one day, then you’d be frozen for an entire month. In practice, this may not be acceptable. At the very least, though, you should be drastically down-scaling your release velocity in order to have more engineering time to devote to fixing the root causes (see “Paying off your error budget debt” above). There are other approaches, too: Check out the discussion about blocking releases in an earlier episode of CRE Life Lessons, where we analyzed an example escalation policy.
As you can see, the rolling period for error budget measurement is less prone to a varying reaction depending on the particular date of an outage. We recommend that you adopt this approach if you can, though admittedly it can be challenging to accumulate this kind of data in monitoring tools currently.
Freezing the release of new features isn’t free of cost. In a worst-case scenario, if your developers are continuing new feature development but not releasing those features to users, the changes will build up, and when you finally resume releases it is almost inevitable that you’re going to see a series of broken releases. We’ve seen this happen in practice: if we impose a freeze on a service over an event like Black Friday or New Year’s, we expect that the week following the freeze will be unusually busy with service failures as all the backed-up changes reach users. To avoid this, it’s important to re-emphasize to teams affected by the freeze that it is intended to provide space to focus on reliability development, not feature development.
Sometimes it’s not possible to freeze all releases. Your company may have a major event coming up, such as a conference, and so there’s a compelling need to push certain new features into production no matter what the recent experience of its users. One process you could adopt in this case is the concept of a silver bullet: The product management team has a (very limited) right to override a release freeze to deploy a critical feature. To make this approach work well, that right needs to be expensive to exercise and limited in frequency: The spend of a silver bullet should be regarded as a failure, and require a postmortem to analyze how it came about and how to mitigate the risk of it happening again.
An error budget is a crucial concept when you’re taking a principled approach to service reliability. Like a household budget, it’s there for you (the service owner) to spend, and it’s important for the service stakeholders to agree on what should happen when you overspend it ahead of doing so. If you find you’ve overspent, a feature freeze can be an effective tool to prioritize development time toward reliability improvements. But remember that reflexively freezing your releases when you blow through your error budget isn’t always the appropriate response. Consider where your budget is being spent, how to reduce the major sources of spend and whether some loosening of the purse strings is in order. The most important principle: Do it based on data!
Interested to learn more about site reliability engineering (SRE) in practice? We’ll be discussing how to apply SRE principles to deploy and manage services at Next ‘18 in July. Join us!