By Alex Bramley, Customer Reliability Engineer; Will Tipton, Site Reliability Engineer
In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.
This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making “revoking support” for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.
It’s important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.
Before getting into the specifics of the escalation policy, it’s important to consider the following broad points.
The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It’s structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.
Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.
For the following four thresholds in the escalation policy, “bringing a service back into SLO” means:
In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it’s unlikely to recur or to automate remediation.
Threshold 1 – wherein SRE are notified that an SLO is potentially impacted
SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.
Threshold 2 – wherein SRE escalates to the developers
Threshold 3 – wherein SRE pauses feature releases and focuses on reliability
Threshold 4 – wherein SRE may escalate or revoke support
SREs are first responders, and there’s an expectation that they’ll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.
Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don’t need to ask the question over and over again, though that’s a strong signal that further escalation is necessary. It’s also OK to insist that developers take back the pager for the service until they’re willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.
Blocking releases is an appropriate course of action for three main reasons:
SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively “borrowing” future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.
SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there’s less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.
We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we’ll take these policy thresholds out for a spin with some hypothetical scenarios.