By Adrian Hilton, Customer Reliability Engineer
Editor’s note: One of the most common causes of service outages is releasing a new version of the service binaries; no matter how good your testing and QA might be, some bugs only surface when the affected code is running in production. Over the years, Google Site Reliability Engineering has seen many outages caused by releases, and now assumes that every new release may contain one or more bugs.
As software engineers, we all like to add new features to our services; but every release comes with the risk of something breaking. Even assuming that we are appropriately diligent in adding unit and functional tests to cover our changes, and undertaking load testing to determine if there are any material effects on system performance, live traffic has a way of surprising us. These are rarely pleasant surprises.
The release of a new binary is a common source of outages. From the point of view of the engineers responsible for the system’s reliability, that translates to three basic tasks:
For the purpose of this analysis, we’ll assume that you are running many instances of your service on machines or VMs behind a load balancer such as nginx, and that upgrading your service to use a new binary will involve stopping and starting each service instance.
We’ll also assume that you monitor your system with something like Stackdriver, measuring internal traffic and error rates. If you don’t have this kind of monitoring in place, then it’s difficult to meaningfully discuss reliability; per the Hierarchy of Reliability described in the SRE Book, monitoring is the most fundamental requirement for a reliable system).
The best case for a bad release is that when a service instance is restarted with the bad release, a major fraction of improperly handled requests generate errors such as HTTP 502, or much higher response latencies than normal. In this case, your overall service error rate rises quickly as the rollout progresses through your service instances, and you realize that your release has a problem.
A more subtle case is when the new binary returns errors on a relatively small fraction of queries – say, a user setting change request, or only for users whose name contains an apostrophe for good or bad reasons. With this failure mode, the problem may only become manifest in your overall monitoring once the majority of your service instances are upgraded. For this reason, it can be useful to have error and latency summaries for your service instance broken down by binary release version.
Before you plan to roll out a new binary or image to your service, you should ask yourself, “What will I do if I discover a catastrophic / debilitating / annoying bug in this release?” Not because it might happen, but because sooner or later it is going to happen and it is better to have a well-thought out plan in place instead of trying to make one up when your service is on fire.
The temptation for many bugs, particularly if they are not show-stoppers, is to build a quick patch and then “roll forward,” i.e., make a new release that consists of the original release plus the minimal code change necessary to fix the bug (a “cherry-pick” of the fix). We don’t generally recommend this though, especially if the bug in question is user-visible or causing significant problems internally (e.g., doubling the resource cost of queries).
What’s wrong with rolling forward? Put yourself in the shoes of the software developer: your manager is bouncing up and down next to your desk, blood pressure visibly climbing, demanding to know when your fix is going to be released because she has your company’s product director bending her ear about all the negative user feedback he’s getting. You’re coding the fix as fast as humanly possible, because for every minute it’s down another thousand users will see errors in the service. Under this kind of pressure, coding, testing or deployment mistakes are almost inevitable.
We have seen this at Google any number of times, where a hastily deployed roll-forward fix either fails to fix the original problem, or indeed makes things worse. Even if it fixes the problem it may then uncover other latent bugs in the system; you’re taking yourself further from a known-good state, into the wilds of a release that hasn’t been subject to the regular strenuous QA testing.
At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second. A request for a rollback is not interpreted as an attack on the releasing team, or even the person who wrote the code containing the bug; rather, it is understood as The Right Thing To Do to make the system as reliable as possible for the user. No-one will ask “why did you roll back this change?” as long as the rollback changelist describes the problem that was seen.
Thus, for rollbacks to work, the implicit assumption is that they are:
How do we make the latter true?
If you haven’t rolled back in a few weeks, you should do a rollback “just because”; aim to find any traps with incompatible versions, broken automation/testing etc. If the rollback works, just roll forward again once you’ve checked out all your logs and monitoring. If it breaks, roll forward to remove the breakage and then focus all your efforts on diagnosing the cause of the rollback breakage. It is better by far to detect this when your new release is working well, rather than being forced off a release that is on fire and having to fight to get back to your known-good original release.
Inevitably, there are going to be times when a rollback is not straightforward. One example is when the new release requires a schema change to an in-app database (such as a new column). The danger is that you release the new binary, upgrade the database schema, and then find a problem with the binary that necessitates rollback. This leaves you with a binary that doesn’t expect the new schema, and hasn’t been tested with it.
The approach we recommend here is a feature-free release; starting from version v of your binary, build a new version v+1 which is identical to v except that it can safely handle the new database schema. The new features that make use of the new schema are in version v+2. Your rollout plan is now:
Now, if there are any problems with either of the new binaries then you can roll back to a previous version without having to also roll back the schema.
This is a special case of a more general problem. When you build the dependency graph of your service and identify all its direct dependencies, you need to plan for the situation where any one of your dependencies is suddenly rolled back by its owners. If your launch is waiting for a dependency service S to move from release r to r+1, you have to be sure that S is going to “stick” at r+1. One approach here is to make an ecosystem assumption that any service could be rolled back by one version, in which case your service would wait for S to reach version r+2 before your service moved to a version depending on a feature in r+1.
We’ve learned that there’s no good rollout unless you have a corresponding rollback ready to do, but how can we know when to rollback without having our entire service burned to the ground by a bad release?
In part 2 we’ll look at the strategy of “canarying” to detect real production problems without risking the bulk of your production traffic on a new release.