谷歌中国开发者社区 (GDG)
  • 主页
  • 博客
    • Android
    • Design
    • GoogleCloud
    • GoogleMaps
    • GooglePlay
    • Web
  • 社区
    • 各地社区
    • 社区历史
    • GDG介绍
    • 社区通知
  • 视频
  • 资源
    • 资源汇总
    • 精选视频
    • 优酷频道

How SREs find the landmines in a service – CRE life lessons

2017-06-30adminGoogleCloudNo comments

By Adrian Hilton, Customer Reliability Engineer

In Part 1 of this blog post we looked at why an SRE team would or wouldn’t choose to onboard a new application. In this installment, we assume that the service would benefit substantially from SRE support, and look at what needs to be done for SREs to onboard it with confidence.

Onboarding review

Q: We have a new application that would make sense for SRE to support. Do I just throw it over the wall and tell the SRE team “Here you are; you’re on call for this now, best of luck”?

That’s a great approach — if your goal is failure. At first, your developer team’s assessment of the application’s importance for their support — and whether it’s in a supportable state — is likely to be rather different from your SRE team’s assessment, and arbitrarily imposing support for a service onto an SRE team is unlikely to work. Think about it — you haven’t convinced them that the service is a good use of their time yet, and human nature is that people don’t enthusiastically embrace doing something that they don’t really believe in, so they’re unlikely to be active participants in making the service materially more reliable.

At Google, we’ve found that to successfully onboard a service into SRE, the service owner and SRE team must agree to a process for the SRE team to understand and assess the service, and identify critical issues to be resolved upfront (Incidentally, we follow a similar process when deciding whether or not to onboard a Google Cloud customer’s application into our Customer Reliability Engineering program). We typically split this into two phases:

  • SRE entrance review: where an SRE team assess whether a developer-supported service should be onboarded by SRE, and what the onboarding preconditions should be.
  • SRE onboarding/takeover: where a dev and SRE team agree in principle that the SRE team should take on primary operational responsibility for a service, and start negotiating the exact conditions for takeover (how and when the SREs will onboard the service).

It’s important to remember the motivations of the various parties in this process:

  • Developers want someone else to pick up support for the service, and make it run as well as possible. They want users to feel that the service is working properly, otherwise they’ll move to a service run by someone else.
  • The SRE team wants to be sure that they’re not being “sold a pup” with a hard-to-support service, and have a vision for making the production service lower in toil and more robust.
  • Meanwhile the company management wants to reduce the number of embarrassing service outages, as long as it doesn’t cost them too much in engineer time.

The SRE entrance review

During an SRE entrance review (SER), also referred to as a Production Readiness Review (PRR), the SRE team takes the measure of a service currently running in production. The purpose of an SER is to:

  1. Assess how the service would benefit from SRE ownership
  2. Identify service design, implementation and operational deficiencies that could be a barrier to SRE takeover
  3. And if SRE ownership is determined to be beneficial, identify the bug fixes, process changes and necessary service behavior needed before onboarding the service

An SRE team typically designates a single person or a small subset of the team to familiarize themselves with the service, and evaluate it for fitness for takeover.

The SRE looks at the service as-is: its performance, monitoring, associated operational processes and recent outage history, and asks themselves: “If I were on-call for this service right now, what are the problems I’d want to fix?” They might be visible problems, such as too many pages happening per day, or potential problems such as a dependency on a single machine that will inevitably fail some day.

A critical part of any SRE analysis is the service’s Service Level Objectives (SLOs), and associated Service Level Indicators (SLIs). SREs assume that if a service is meeting its SLOs then paging alerts should be rare or non-existent; conversely, if the service is in danger of falling out of SLO then paging alerts are loud and actionable. If these expectations don’t match reality, the SRE team will focus on changing either the SLO definitions or the SLO measurements.

In the review phase, SREs aim to understand:

  • what the service does
  • day-to-day service operation (traffic variation, releases, experiment management, config pushes)
  • how the service tends to break and how this manifests in alerts
  • rough edges in monitoring and alerting
  • where the service configuration diverges from the SRE team’s practices
  • major operational risks for the service

The SRE team also considers:

  • whether the service follows SRE team best practices, and if not, how to retrofit it
  • how to integrate the service with the SRE team’s existing tools and processes
  • the desired engagement model and separation of responsibilities between the SRE team and the SWE team. When debugging a critical production problem, at what point should the SRE on-call page the developer on-call?

The SRE takeover

The SRE entrance review typically produces a prioritized list of issues with the service that need to be fixed. Most will be assigned to the development team, but the SRE team may be better suited for others. In addition, not all issues are blockers to SRE takeover (there might be design or architectural changes that SREs recommend for service robustness that could take many months to implement).

There are four main axes of improvement for a service in an onboarding process: extant bugs, reliability, automation and monitoring/alerting. On each axis there will be issues which will have to be solved before takeover (“blockers”), and others which would be beneficial to solve but not critical.

Extant bugs
The primary source of issues blocking SRE takeover tends to be action items from the service’s previous postmortems. The SRE team expects to read recent postmortems and verify that a) the proposed actions to resolve the outage root causes are what they’d expect and b) those actions are actually complete. Further, the absence of recent postmortems is a red flag for many SRE teams.
Reliability
Some reliability-related change requests might not directly block SRE takeover, as many reliability improvements relate to design, significant code changes, a change in back-end integrations or migration off a deprecated infrastructure component, and are targeting the longer-term evolution of the system towards a desired reliability increase.

The reliability-related changes that block takeover would be those which mitigate or remove issues which are known to cause significant downtime, or mitigate risks which are expected to cause an outage in the future.

Automation
This is a key concern for SREs considering take over of a service: how much manual work needs to be done to “operate” the service on a week-to-week basis, including configuration pushes, binary releases and similar time-sinks.

In order to find out what would be most useful to automate, the best way is for the SRE to get practical experience of the developer’s world. This means that the SREs should shadow the developer team’s typical week and get a feel for what routine manual work is actually involved for their on-call.

If there’s excessive manual work involved in supporting a service, automation usually solves the problem.

Monitoring/alerting
The dominant concern with most services undergoing SRE takeover is the paging rate — how many times the service wakes up the on-call staff. At Google, we adhere to the ”Treynor Maximum” of an average of two incidents per 12 hour shift (for an on-call team as a whole). Thus, an SRE team looks at the average incident load of a new service over the past month or so to see how it fits with their current incident load.

Generally, excessive paging rates are the result of one of three things:

  1. Paging on something that’s not intrinsically important e.g., task restart or hitting 80% capacity of disk. Instead, downgrade the page to a bug (if it’s not urgent) or eliminate it entirely. Moving to symptom-based monitoring (“users are actually seeing problems”) can help improve this situation.
  2. Page storms where one small incident/outage generates many pages. Try to group related pages for an incident into a single outage, to get a clearer picture of the system’s outage metrics.
  3. A system that’s having too many genuine problems. In this case SRE takeover in the near future is unlikely, but SREs may be able to help diagnose and resolve the root causes of the problems.

SREs generally want to see several weeks of low paging levels before agreeing to take over a service.

More general ways to improve the service might include:

  • integrating the service with standard SRE tools and practices e.g., load shedding, release processes and configuration pushes
  • extending and improving playbook entries to rely less on the developer team’s tribal knowledge
  • aligning the service’s configurations with the SRE team’s common languages and infrastructure

Ultimately, an SRE entrance review should produce guidance that’s useful to developers even if the SRE team declines to onboard the service. In that event, the guidance from the review should still help developers make their service easier to operate and more reliable.

Smoothing the path

SREs need to understand the developers’ service, but SREs and developers also need to understand each other. If the developer team has not worked with SREs before, it can be useful for SREs to give “lightning” talks to the developers on SRE topics such as monitoring, canarying, rollouts and data integrity. This gives the developers a better idea of why the SREs are asking particular questions and pushing particular concerns.

One of Google’s SREs found that it was useful to “pretend that I am a dev team novice, and have the developer take me through the codebase, explain the history, show me where the main() function is, and so on.”

Similarly, SREs should understand the developers’ point of view and experience. During the SER, at least one SRE should sit with the developers, attend their weekly meetings and stand-ups, informally shadow their on-call and help out with day-to-day work to get a “big picture” view of the service and how it runs. It also helps remove distance between the two teams. Our experience has been that this is so positive in improving the developer-SRE relationship that the practice tends to continue even after the SER has finished.

Last but not least, the SRE entrance review document should also state clearly whether the service merits SRE takeover, and if so, why (or why not).

At this point, the developer team and SRE team both understand what needs to be done to make a service suitable for SRE takeover, if it is indeed feasible at all. In Part 3 of this blog post, we’ll look at how to proceed with a service takeover, and so both teams can benefit from the process.



Source: How SREs find the landmines in a service – CRE life lessons

除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。

Tags: AdWords

Related Articles

Dynamic audiences in Google Analytics for Firebase

2019-02-13admin

Announcing Change History for Firebase Remote Config

2018-08-24admin

How we built a serverless digital archive with machine learning APIs, Cloud Pub/Sub and Cloud Functions

2018-01-24admin

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Recent Posts

  • Learning to Generalize from Sparse and Underspecified Rewards
  • Enabling connected transformation with Apache Kafka and TensorFlow on Google Cloud Platform
  • arXiv LaTeX cleaner: safer and easier open source research papers
  • Five new investments for the Google Assistant Investments program
  • Expanding target API level requirements in 2019

Recent Comments

  • Chen Zhixiang on Concurrent marking in V8
  • admin on 使用 Android Jetpack 加快应用开发速度
  • 怪盗kidou on 使用 Android Jetpack 加快应用开发速度
  • 鸿维 on Google 帐号登录 API 更新
  • admin on 推出 CVPR 2018 学习图像压缩挑战赛

Archives

  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 1970

Categories

  • Android
  • Design
  • Firebase
  • GoogleCloud
  • GoogleDevFeeds
  • GoogleMaps
  • GooglePlay
  • Google动态
  • iOS
  • Uncategorized
  • VR
  • Web
  • WebMaster
  • 社区
  • 通知

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

最新文章

  • Learning to Generalize from Sparse and Underspecified Rewards
  • Enabling connected transformation with Apache Kafka and TensorFlow on Google Cloud Platform
  • arXiv LaTeX cleaner: safer and easier open source research papers
  • Five new investments for the Google Assistant Investments program
  • Expanding target API level requirements in 2019
  • The service mesh era: Securing your environment with Istio
  • Launchpad Accelerator Mexico now accepting startup applications
  • On the Path to Cryogenic Control of Quantum Processors
  • Re-thinking federated identity with the Continuous Access Evaluation Protocol
  • Real-time diagnostics from nanopore DNA sequencers on Google Cloud

最多查看

  • 谷歌招聘软件工程师 (21,052)
  • Google 推出的 31 套在线课程 (20,152)
  • 如何选择 compileSdkVersion, minSdkVersion 和 targetSdkVersion (18,793)
  • Seti UI 主题: 让你编辑器焕然一新 (12,700)
  • Android Studio 2.0 稳定版 (8,963)
  • Android N 最初预览版:开发者 API 和工具 (7,935)
  • 像 Sublime Text 一样使用 Chrome DevTools (5,951)
  • Google I/O 2016: Android 演讲视频汇总 (5,520)
  • 用 Google Cloud 打造你的私有免费 Git 仓库 (5,511)
  • 面向普通开发者的机器学习应用方案 (5,201)
  • 生还是死?Android 进程优先级详解 (4,972)
  • 面向 Web 开发者的 Sublime Text 插件 (4,141)
  • 适配 Android N 多窗口特性的 5 个要诀 (4,106)
  • 参加 Google I/O Extended,观看 I/O 直播,线下聚会! (3,477)
© 2018 中国谷歌开发者社区 - ChinaGDG