By Adrian Hilton and Gwendolyn Stockman, Customer Reliability Engineers, and Dave Rensin, Director of Customer Reliability Engineering
In our previous post we discussed the benefits of sharing internal postmortems outside your company. You may adopt a one:many approach with an incident summary that tells all your customers what happened and how you’ll prevent it from happening again. Or, if the incident impacted a major customer, you may share something close to your original postmortem with them.
In this post, we consider how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices. We also present a worked example of a shared postmortem based on the SRE Book postmortem template.
How to get outages to benefit everyone
Even if the fault was 100% on you, the platform side, an external postmortem can still help customers improve their reliability. Now that we know what happens when a particular failure occurs, how can we generalize this to help the customer mitigate the impact, and reduce MTTD and MTTR for a similar incident in the future?
One of the best sources of data for any postmortem is your customers’ SLOs, with their ability to measure the impact of a platform outage. Our CRE team talks about SLOs quite a lot in the CRE Life Lessons series, and there’s a reason why: SLOs and error budgets inform more than just whether to release features in your software.
For customers with defined SLOs who suffered a significant error budget impact, we recommend conducting a postmortem review with them. The review is partly to ensure that the customer’s concerns were addressed, but also to identify “what went wrong,” “where we got lucky” and how to identify actions which would address these for the customer.
For example, the platform’s storage service suffered increased latency for a certain class of objects in a region. This is not the customer’s fault, but they may still be able to do something about it.
The internal postmortem might read something like:
What went well
What went wrong
Where we got lucky
When a customer writes their own postmortem about this incident, using the shared postmortem to understand better what broke in the platform and when, that postmortem might look like:
What went well
What went wrong
Where we got lucky
A postmortem isn’t complete until the root causes have been fixed
Sharing the current status of your postmortem action items is tricky. It’s unlikely that the customer will be using the same issue tracking system as you are, so neither side will have a “live” view of which action items from a postmortem have been resolved, and which are still open. Within Google we have automation which tracks this and “reminds” us of unclosed critical actions from postmortems, but customers can’t see those unless we surface them in the externally-visible part of our issue tracking system, which is not our normal practice.
Currently, we hold a monthly SLO review with each customer, where we list the major incidents and postmortem/incident report for each incident; we use that occasion to report on open critical bug statuses from previous months’ incidents, and check to see how the customer is doing on their actions.
Opening up is an opportunity
There are practical reliability benefits of sharing postmortems, but there are other benefits too. Customers who are evolving towards an SRE culture and adopting blameless postmortems can use the external postmortem as a model for their own internal write-ups. We’re the first to admit that it’s really hard to write your own first postmortem from scratch—having a collection of “known-good” postmortems as a reference can be very helpful.
At a higher level, shared postmortems give your customer a “glimpse behind the curtain.” When a customer moves from on-premises hardware to the cloud, it can be frightening; they’re giving up a lot of control of and visibility into the platform on which their service runs. The cloud is expected to encapsulate the operational details of the services it offers, but unfortunately it can be guilty of hiding information that the customer really wants to see. A detailed external postmortem makes that information visible, giving the customer a timeline and deeper detail, which hopefully they can relate to.
If you want joint operations, you need joint postmortems
The final step in the path to shared postmortems is creating a joint postmortem. Until this point, we’ve discussed how to externalize an existing document, where the action items, for example, are written by you and assigned to you. With some customers, however, it makes sense to do a joint postmortem where you both contribute to all sections of the document. It will not only reflect your thoughts from the event, but it will also capture the customer’s thoughts and reactions, too. It will even include action items that you assign to your customer, and vice-versa!
Of course, you can’t do joint postmortems with large numbers of your customers, but doing so with at least a few of them helps you (a) build shared SRE culture, and (b) keep the customer perspective in your debugging, design and planning work.
Joint postmortems are also one of the most effective tools you have to persuade your product teams to re-prioritize items on their roadmap, because they present a clear end-user story of how those items can prevent or mitigate future outages.
Sharing your postmortems with your customers is not an easy thing to do; however, we have found that it helps:
To get you started, here’s an example of an external postmortem for the aforementioned storage frontend outage, using the SRE Book postmortem template. (Note: Text relating to the customer (“JaneCorp”) is marked in purple for clarity.) We hope it sets you on the path to learning and growing from your outages. Happy shared postmortem writing!