By Adrian Hilton, Customer Reliability Engineer
In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.
Q: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?
That’s a great approach — if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support — and whether it’s in a supportable state — is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it — you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they’re unlikely to be active participants in making the service materially more reliable.
At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:
It’s important to remember the motivations of the various parties in this process:
During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:
An SRE team typically designates a single person or a small subset of the team to familiarize themselves with the service, and evaluate it for fitness for takeover.
The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.
A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.
In the review phase, SREs aim to understand:
The SRE team also considers:
The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).
There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.
The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.
In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.
If there’s excessive manual work involved in supporting a service, automation usually solves the problem.
The dominant concern with most services undergoing SRE takeover is the paging rate — how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.
Generally, excessive paging rates are the result of one of three things:
SREs generally want to see several weeks of low paging levels before agreeing to take over a service.
More general ways to improve the service might include:
Ultimately, an SRE entrance review should produce guidance that’s useful to developers even if the SRE team declines to onboard the service. In that event, the guidance from the review should still help developers make their service easier to operate and more reliable.
SREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.
One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”
Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.
Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).
At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.