Source: Transitioning a typical engineering ops team into an SRE powerhouse from Google Cloud
Perpetually adding engineers to ops teams to meet customer growth doesn’t scale. Google’s Site Reliability Engineering (SRE) principles can help, bringing software engineering solutions to operational problems. In this post, we’ll take a look at how we transformed our global network ops team by abandoning traditional network engineering orthodoxy and replacing it with SRE. Read on to learn how Google’s production networking team tackled this problem and consider how you might incorporate SRE principles in your own organization.
Scaling to the limit
In 2011, a talented and rapidly growing team of engineers supported Google’s production network: a constellation of technology, constantly growing, and constantly in need of attention. We were debugging, upgrading, upsizing, repairing, monitoring, and installing 24 hours a day, seven days a week. We were spread across three time zones, and we followed the sun.
In a 100-person team, communication was hard, and decision making was even harder. As a consequence, a tendency toward resisting change crept in. With resistance to change came difficulty in supporting Google’s agile development teams. Therefore, as a logical next step, we broke this large group into smaller teams, each with more focus. That was certainly necessary, and it helped us to go deeper into the technology and make better decisions, but this, too, had a time limit. The technology evolved on a weekly basis, and eventually the workload started to outstrip the available engineering resources. The constant demand for specialist expertise meant that it wasn’t possible to simply throw more people at the problem.
As an example, let’s say that Google’s network had 100 routers that carried its production traffic, and we wanted to upgrade each router, each quarter. Well, that’s roughly 33 routers divided between 33 people in each engineering site, or one per person. That’s a piece of cake: We all got to upgrade one router each quarter.
That doesn’t sound burdensome, but let’s say we found a bug in the latest release and we needed to roll back. Further, what would happen when we got to 1,000 routers? Each engineer now has to upgrade 10 routers every month. How about 10,000 routers? You’ve got to be kidding me. Upgrading router software every day for the whole year? It became clear that we would eventually be performing this work to the exclusion of other important work, and struggling to hire enough people (and train them!) to keep up with the demand.
Finding a new hope
Indeed, the idea of upgrading router software day-in, day-out with no reprieve didn’t sound like a job we could hire for in the long term. What we noticed about this particular task was that it looked:
Check, check, check—you may be wondering, “What does upgrading routers have to do with transforming a team?” Domain expertise in operating Google’s network is hard-earned; we wanted to transform our network engineers into SREs (rather than merely replace engineers with SREs), so we could retain them and their expertise in software and systems. We approached this carefully and grew our confidence through a series of engineering wins, with this router upgrade challenge being our first.
Upgrading routers was a good candidate for a software engineering project, and fit well with the description of SRE from Google’s VP of Engineering Ben Treynor: “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.”
But we were a team of network engineers, experts in the likes of Cisco and Juniper routers. What did we know about writing software? At the time, not a lot. Along with having network ops backgrounds, we didn’t think about our problems as though they were simply a software system waiting to be built.
We decided to take a risk: We were going to write software to solve our problem. As engineers who cared for the network, we were genuinely worried that we might run out of people to upgrade our routers, and that would have been a much greater risk to the business. After a few months, we got a prototype working, leaning on our partners in adjacent SRE and development teams for advice.
Our operation group’s senior leaders empowered us to take the project further, but not without careful exploration of the risks involved. Automation can perform work that’s laborious for humans in record time, but if it fails, it can cause record damage. Being one of the last domains inside Google to turn to an SRE approach, we were able to build on past experience in the machine world.
At first, it felt unnatural to lose the direct experience of connecting to a network device and have software do it, even more so for those network engineers who hadn’t worked on the software project. That is a feeling that anyone who replaces human operations with software is going to encounter. Eventually, our persistence paid off, and by publishing our designs and demonstrating the system’s safety features, we won the trust of the rest of the network ops group.
Twelve months later, having a network engineer upgrade a router manually became the exception rather than the rule. In fact, the system was so much more reliable that manual upgrades demanded some rationale. In a short period of time, our small team that built the upgrade system found that they were at the forefront of solving a very new and novel set of problems: scaling the system up and engineering for reliability.
Once we reached this point, we had proven that SRE principles could be applied in our domain. Essentially, there was nothing special about networks that made them unsuitable for SRE.
Soon after that, more engineers followed, until around 10% of the team was successfully building systems to automate toil. We built metrics to quantify the impact, and yet it was clear we still couldn’t keep abreast of the ever-mounting toil from our growth.
Embarking on a full-scale conversion
SRE execution was driving our ability to meet demand, so we decided that what we needed was more of it—a lot more. What came next was the largest full-scale conversion of an operations team to SRE at Google. We recognized that the network engineer job role was not a good fit for the team or our business needs any longer, and so we set a deadline to transition all staff in the team to a more appropriate role—SRE. This was probably the most powerful signal that a real change in execution was afoot.
Over the course of 18 months, our team leaders made plans to split into four separate SRE teams, each responsible for a different part of Google’s network infrastructure. Instead of following the sun—meaning each of the three sites handing issues to the next site as the working day ended—each of the teams would instead be spread across two locations, each covering a 12-hour on-call shift.
There were trade-offs in switching to this model, and it took the team time to adjust. On the one hand, engineers in the team now needed to carry an off-duty pager, and their regular working group shrunk. But on the other, collaboration increased and decision making became much easier because the working groups were smaller. Operational issues that were previously handed to a different person each day were now handed back-and-forth between a smaller set of on-callers, which resulted in an improvement in ownership and willingness to make forward progress. Meetings between the sites could now largely happen during time zone overlap when everyone was in their usual working hours, and this helped to build a single-team identity that just happened to stretch across two time zones.
We discovered some real skill gaps in the team. For starters, most staff had no software engineering experience (but they did know the network!), so we spun up internal programs to educate everyone, leaving room for self-study time and giving ourselves lots of room to ramp up, fail safely, and ask questions. We didn’t do that alone: A lot of help came from our peers in software development and SRE teams, who provided classes, exercises and hand-holding until we built a critical mass of talent and could teach internally. We recruited a handful of willing teachers who could guide the journey—experienced SREs and SWEs from the engineering pool.
We learned that conducting job interviews for our network engineers to transition to the site reliability engineering role was inefficient and slowed us down. Interviews are targeted at external hires and we already knew we wanted to keep our staff. We also didn’t need our staff to prepare for interviews—we wanted them to build systems to replace operations functions. After all, we still needed to do our day job. To compensate for this, we created a new process to submit work evidence that demonstrated the key competencies of a Google SRE. If the evidence stacked up, engineers were switched to SRE.
Router upgrades, and many other successful new systems, were born of this journey, and these engineering projects were what drove our success. It became a self-perpetuating culture cycle: build systems, lower toil, become an SRE, build more systems, make them reliable, and so on.
Getting started on your own SRE journey
If you’re reading this page, you’ll notice that Google’s network is still delivering packets; in fact, it grew by an order of magnitude. This transformation wasn’t at all straightforward. There were many logistical issues to solve, careful teasing apart of workloads, planning of new systems, training of staff, dealing with fear, uncertainty and doubt, and learning to grow in ways nobody quite imagined at the beginning of their careers. Ultimately, though, it was possible.
We got a lot of help along the way. If you’re starting an SRE function from scratch, this help may not be immediately available. Having existing SREs and SWEs who could help us with training and cultural transfer was an enormous win. Setting job titles aside, what really mattered was having talented engineers on the team who were determined to understand and adopt SRE principles and practices—and importantly, who could code.
Our thesis held water: Yes, it is possible to take a traditional engineering operations team and turn that team into a successful—might I even say wonderful—team of SREs.
If you’re contemplating a change like this in your own organization, here’s one final thought: Do you have people in your operational teams who can solve problems with code? If so, have you empowered them to try?
Thanks to Steve Carstensen, Adrian Hilton, Dave Rensin, John Truscott Reese, Jamie Wilkinson, David Parrish, Matt Brown, Gustavo Franco, David Ferguson, JB Feldman, Anton Tolchanov, Alec Warner, Jennifer Petoff, Shylaja Nukala, and Christine Cignoli, among others, for their insights and contributions to this blog.