谷歌中国开发者社区 (GDG)
  • 主页
  • 博客
    • Android
    • Design
    • GoogleCloud
    • GoogleMaps
    • GooglePlay
    • Web
  • 社区
    • 各地社区
    • 社区历史
    • GDG介绍
    • 社区通知
  • 视频
  • 资源
    • 资源汇总
    • 精选视频
    • 优酷频道

Transitioning a typical engineering ops team into an SRE powerhouse

2019-10-04adminGoogleCloudNo comments

Source: Transitioning a typical engineering ops team into an SRE powerhouse from Google Cloud

Perpetually adding engineers to ops teams to meet customer growth doesn’t scale. Google’s Site Reliability Engineering (SRE) principles can help, bringing software engineering solutions to operational problems. In this post, we’ll take a look at how we transformed our global network ops team by abandoning traditional network engineering orthodoxy and replacing it with SRE. Read on to learn how Google’s production networking team tackled this problem and consider how you might incorporate SRE principles in your own organization.

Scaling to the limit
In 2011, a talented and rapidly growing team of engineers supported Google’s production network: a constellation of technology, constantly growing, and constantly in need of attention. We were debugging, upgrading, upsizing, repairing, monitoring, and installing 24 hours a day, seven days a week. We were spread across three time zones, and we followed the sun.

In a 100-person team, communication was hard, and decision making was even harder. As a consequence, a tendency toward resisting change crept in. With resistance to change came difficulty in supporting Google’s agile development teams. Therefore, as a logical next step, we broke this large group into smaller teams, each with more focus. That was certainly necessary, and it helped us to go deeper into the technology and make better decisions, but this, too, had a time limit. The technology evolved on a weekly basis, and eventually the workload started to outstrip the available engineering resources. The constant demand for specialist expertise meant that it wasn’t possible to simply throw more people at the problem.

As an example, let’s say that Google’s network had 100 routers that carried its production traffic, and we wanted to upgrade each router, each quarter. Well, that’s roughly 33 routers divided between 33 people in each engineering site, or one per person. That’s a piece of cake: We all got to upgrade one router each quarter.

That doesn’t sound burdensome, but let’s say we found a bug in the latest release and we needed to roll back. Further, what would happen when we got to 1,000 routers? Each engineer now has to upgrade 10 routers every month. How about 10,000 routers? You’ve got to be kidding me. Upgrading router software every day for the whole year? It became clear that we would eventually be performing this work to the exclusion of other important work, and struggling to hire enough people (and train them!) to keep up with the demand.

Finding a new hope
Indeed, the idea of upgrading router software day-in, day-out with no reprieve didn’t sound like a job we could hire for in the long term. What we noticed about this particular task was that it looked:

  • Repetitive
  • Mundane
  • Automatable

Check, check, check—you may be wondering, “What does upgrading routers have to do with transforming a team?” Domain expertise in operating Google’s network is hard-earned; we wanted to transform our network engineers into SREs (rather than merely replace engineers with SREs), so we could retain them and their expertise in software and systems. We approached this carefully and grew our confidence through a series of engineering wins, with this router upgrade challenge being our first.

Upgrading routers was a good candidate for a software engineering project, and fit well with the description of SRE from Google’s VP of Engineering Ben Treynor: “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.”

But we were a team of network engineers, experts in the likes of Cisco and Juniper routers. What did we know about writing software? At the time, not a lot. Along with having network ops backgrounds, we didn’t think about our problems as though they were simply a software system waiting to be built.

We decided to take a risk: We were going to write software to solve our problem. As engineers who cared for the network, we were genuinely worried that we might run out of people to upgrade our routers, and that would have been a much greater risk to the business. After a few months, we got a prototype working, leaning on our partners in adjacent SRE and development teams for advice.

Our operation group’s senior leaders empowered us to take the project further, but not without careful exploration of the risks involved. Automation can perform work that’s laborious for humans in record time, but if it fails, it can cause record damage. Being one of the last domains inside Google to turn to an SRE approach, we were able to build on past experience in the machine world.

At first, it felt unnatural to lose the direct experience of connecting to a network device and have software do it, even more so for those network engineers who hadn’t worked on the software project. That is a feeling that anyone who replaces human operations with software is going to encounter. Eventually, our persistence paid off, and by publishing our designs and demonstrating the system’s safety features, we won the trust of the rest of the network ops group.

Twelve months later, having a network engineer upgrade a router manually became the exception rather than the rule. In fact, the system was so much more reliable that manual upgrades demanded some rationale. In a short period of time, our small team that built the upgrade system found that they were at the forefront of solving a very new and novel set of problems: scaling the system up and engineering for reliability.

Once we reached this point, we had proven that SRE principles could be applied in our domain. Essentially, there was nothing special about networks that made them unsuitable for SRE.

Soon after that, more engineers followed, until around 10% of the team was successfully building systems to automate toil. We built metrics to quantify the impact, and yet it was clear we still couldn’t keep abreast of the ever-mounting toil from our growth.

Embarking on a full-scale conversion
SRE execution was driving our ability to meet demand, so we decided that what we needed was more of it—a lot more. What came next was the largest full-scale conversion of an operations team to SRE at Google. We recognized that the network engineer job role was not a good fit for the team or our business needs any longer, and so we set a deadline to transition all staff in the team to a more appropriate role—SRE. This was probably the most powerful signal that a real change in execution was afoot.

Over the course of 18 months, our team leaders made plans to split into four separate SRE teams, each responsible for a different part of Google’s network infrastructure. Instead of following the sun—meaning each of the three sites handing issues to the next site as the working day ended—each of the teams would instead be spread across two locations, each covering a 12-hour on-call shift.

There were trade-offs in switching to this model, and it took the team time to adjust. On the one hand, engineers in the team now needed to carry an off-duty pager, and their regular working group shrunk. But on the other, collaboration increased and decision making became much easier because the working groups were smaller. Operational issues that were previously handed to a different person each day were now handed back-and-forth between a smaller set of on-callers, which resulted in an improvement in ownership and willingness to make forward progress. Meetings between the sites could now largely happen during time zone overlap when everyone was in their usual working hours, and this helped to build a single-team identity that just happened to stretch across two time zones.

We discovered some real skill gaps in the team. For starters, most staff had no software engineering experience (but they did know the network!), so we spun up internal programs to educate everyone, leaving room for self-study time and giving ourselves lots of room to ramp up, fail safely, and ask questions. We didn’t do that alone: A lot of help came from our peers in software development and SRE teams, who provided classes, exercises and hand-holding until we built a critical mass of talent and could teach internally. We recruited a handful of willing teachers who could guide the journey—experienced SREs and SWEs from the engineering pool.

We learned that conducting job interviews for our network engineers to transition to the site reliability engineering role was inefficient and slowed us down. Interviews are targeted at external hires and we already knew we wanted to keep our staff. We also didn’t need our staff to prepare for interviews—we wanted them to build systems to replace operations functions. After all, we still needed to do our day job. To compensate for this, we created a new process to submit work evidence that demonstrated the key competencies of a Google SRE. If the evidence stacked up, engineers were switched to SRE.

Router upgrades, and many other successful new systems, were born of this journey, and these engineering projects were what drove our success. It became a self-perpetuating culture cycle: build systems, lower toil, become an SRE, build more systems, make them reliable, and so on.

Getting started on your own SRE journey
If you’re reading this page, you’ll notice that Google’s network is still delivering packets; in fact, it grew by an order of magnitude. This transformation wasn’t at all straightforward. There were many logistical issues to solve, careful teasing apart of workloads, planning of new systems, training of staff, dealing with fear, uncertainty and doubt, and learning to grow in ways nobody quite imagined at the beginning of their careers. Ultimately, though, it was possible.

We got a lot of help along the way. If you’re starting an SRE function from scratch, this help may not be immediately available. Having existing SREs and SWEs who could help us with training and cultural transfer was an enormous win. Setting job titles aside, what really mattered was having talented engineers on the team who were determined to understand and adopt SRE principles and practices—and importantly, who could code.

Our thesis held water: Yes, it is possible to take a traditional engineering operations team and turn that team into a successful—might I even say wonderful—team of SREs.

If you’re contemplating a change like this in your own organization, here’s one final thought: Do you have people in your operational teams who can solve problems with code? If so, have you empowered them to try?

Learn more about SRE fundamentals.


Thanks to Steve Carstensen‎, Adrian Hilton, Dave Rensin, John Truscott Reese, Jamie Wilkinson, David Parrish, Matt Brown, Gustavo Franco, David Ferguson, JB Feldman, Anton Tolchanov, Alec Warner, Jennifer Petoff, Shylaja Nukala, and Christine Cignoli, among others, for their insights and contributions to this blog.

除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags: Cloud

Related Articles

How the energy industry is using the cloud

2018-12-18admin

Google Cloud networking in depth: Cloud CDN

2019-06-07admin

How to transfer BigQuery tables between locations with Cloud Composer

2018-10-04admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Recent Posts

  • Admin Essentials: know your options for Modern Enterprise Browser Management
  • TheVentureCity and Google Consolidate Miami as a Tech Powerhouse
  • Keep a better eye on your Google Cloud environment
  • Using HLL++ to speed up count-distinct in massive datasets
  • Season of Docs Announces Results of 2019 Program

Recent Comments

  • admin on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Martijn on Using advanced Kubernetes autoscaling with Vertical Pod Autoscaler and Node Auto Provisioning
  • Chen Zhixiang on Concurrent marking in V8
  • admin on 使用 Android Jetpack 加快应用开发速度

Archives

  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 1970

Categories

  • Android
  • Design
  • Firebase
  • GoogleCloud
  • GoogleDevFeeds
  • GoogleMaps
  • GooglePlay
  • Google动态
  • iOS
  • Uncategorized
  • VR
  • Web
  • WebMaster
  • 社区
  • 通知

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

最新文章

  • Admin Essentials: know your options for Modern Enterprise Browser Management
  • TheVentureCity and Google Consolidate Miami as a Tech Powerhouse
  • Keep a better eye on your Google Cloud environment
  • Using HLL++ to speed up count-distinct in massive datasets
  • Season of Docs Announces Results of 2019 Program
  • Admin Insider: What's new in Chrome Enterprise, Release 79
  • Discover insights from text with AutoML Natural Language, now generally available
  • Introducing Storage Transfer Service for on-premises data
  • How Mynd uses G Suite to manage a flurry of acquisitions
  • W3C Trace Context Specification: What it Means for You

最多查看

  • 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (25,371)
  • Google 推出的 31 套在线课程 (22,456)
  • 谷歌招聘软件工程师 (22,336)
  • Seti UI 主题: 让你编辑器焕然一新 (13,823)
  • Android Studio 2.0 稳定版 (9,420)
  • Android N 最初预览版:开发者 API 和工具 (8,036)
  • 像 Sublime Text 一样使用 Chrome DevTools (6,323)
  • 用 Google Cloud 打造你的私有免费 Git 仓库 (6,076)
  • Google I/O 2016: Android 演讲视频汇总 (5,608)
  • 面向普通开发者的机器学习应用方案 (5,539)
  • 生还是死?Android 进程优先级详解 (5,228)
  • 面向 Web 开发者的 Sublime Text 插件 (4,341)
  • 适配 Android N 多窗口特性的 5 个要诀 (4,311)
  • 参加 Google I/O Extended,观看 I/O 直播,线下聚会! (3,620)
© 2019 中国谷歌开发者社区 - ChinaGDG