By Adrian Hilton, Customer Reliability Engineer
The first part of any reliable software release is being able to roll back if something goes wrong; we discussed how we do this at Google in last week’s post, Reliable releases and rollbacks. Once you have that under your belt, you’ll want to understand how to detect that things are starting to go wrong in the first place, with canarying.
|Photo taken by David Carroll|
The concept of canarying first emerged in 1913 when physiologist John Scott Haldane took the caged bird down into a coal mine, to detect for carbon monoxide. This fragile bird is more susceptible to the odorless gas than humans, and quickly falls off its perch in its presence — signaling to the miners that it’s time to get out!
In software, a canary process is usually the first instance that receives live production traffic about a new configuration update, either a binary or configuration rollout. The new release only goes to the canary at first. The fact that the canary handles real user traffic is key: if it breaks, real users get affected, so canarying should be the first step in your deployment process, as opposed to the last step in testing.
The first step in implementing canarying is a manual process where release engineers trigger the new binary release to the canary instance(s). They then monitor the canary for any signs of increased errors, latency and load. If everything looks good, they then trigger a release to the rest of the production instances.
We here on Google’s SRE teams have found over time that manual inspection of monitoring graphs isn’t sufficiently reliable to detect performance problems or rises in error rates of a new release. When most releases work well, the release engineer gets used to seeing no problems and so, when a low-level problem appears, tends to implicitly rationalize the monitoring anomalies as “noise.” We have several internal postmortems on bad releases whose root cause boils down to “the canary graph wasn’t wiggly enough to make the release engineer concerned.”
We’ve moved towards automated analysis, where our canary rollout service measures the canary tasks to detect elevated errors, latency and load automatically — and roll back automatically. (Of course, this only works if rollbacks are safe!)
Likewise, if you implement canaries as part of your releases, take care to make it easy to see problems with a release. Consider very carefully how you implement fault tolerance in your canary tasks; it’s fine for the canary to do the best it can with a query, but if it starts to see errors either internally or from its dependency services then it should “squawk loudly” by manifesting those problems in your monitoring. (There’s a good reason why the Welsh miners didn’t breed canaries to be resistant to toxic gases, or put little gas masks on them.)
If you’re doing releases of client software, you should have a mechanism for canarying new versions of the client, and you’ll need to answer the following questions:
A solution for question 2 is for clients to identify themselves to your backend service — ideally, by including information in each request about the client’s operating system and application version ID — and for the server to log this information. If you can make the clients identify themselves specifically as canaries, so much the better; this lets you export their stats to a different set of monitoring metrics. To detect that clients are failing to send queries, you’ll generally need to know what the lowest plausible amount of incoming traffic is at any given time of the day or week, and trigger an alert if inbound traffic drops below that amount.
Typically, alerting rules for canaries for high-availability systems use a longer evaluation duration (how long you listen to the monitoring signals before deciding you have a problem) than for the main system because the much smaller traffic amount makes the standard signal much noisier; a relatively innocuous problem such as a few service instances being restarted can briefly push the canary error rate above the regular alarm threshold.
Your release should normally aim to cover a wide range of user types but a small fraction of active users. For Android clients, the Google Play Store allows you to deploy a new version of your application package file (APK) to an (essentially random) fraction of users; you can do this on a country-by-country basis. However, see the discussion on Android APK releases below for the limitations and risks in this approach.
If your end users access your service via desktop or mobile web rather than an application, you tend to have better control of what’s being executed.
New versions of an Android app can be rolled out to a % of current users using staged rollouts in the Play Store. This lets you try out a new release of an app on a small subset of your current users; once you have confidence in that release, you can roll it out to more users, and so on.
The % release mechanism marks a percent of users that are eligible to pick up the new release. When their mobile device next checks into the Play Store for updates, it will see an available update for the app and start the update process.
There can be problems with this approach though:
Following the canarying process described above, you can determine whether your new client release has a problem once your active user base of the canary grows enough for the characteristics of the new traffic become clear: Is there a higher error rate? Is the latency rising? Has traffic to your server mysteriously increased sharply?
If you have a known bad release of your app at version v, the most expedient fix (given the inability to roll back) might be to build your version v-1 code branch into release v+1 and release that, stepping up quickly to 100%. That removes the time pressure to fix the problems detected in code.
When you perform a gradual release of a new binary or app, you need to decide in what percentage increments to release your application, and when to trigger the next step in a release. Consider:
If you’re looking to roll out an Android app update to most of your users within a few days, you might choose to use a Play Store staged update starting with a 10% rollout that then increases to 50% and finally 100%. Plan for at least 24 hours between release stages and check your monitoring and logging before the next step. This way, a large fraction of your user base picks up the new release within 72 hours of the initial release, and it’s possible to detect most problems before they become too big to handle. For launches where you know there’s a risk of significant traffic increase to a service, choose to use steps of 10%, 25%, 50% and 100% — or even more fine-grained increases.
For internal binary releases where you update your service instances directly, you might instead choose to use steps of 1%, 10% then 100%. The 1% release lets you see if there’s any gross error in the new release, e.g., if 90% of responses are errors. The 10% release lets you pick up errors or latency increases that are one order of magnitude smaller, and detect any gross performance differences. The third step is normally a complete release. For performance-sensitive systems — generally, those operating at 75%+ of capacity — consider adding a 50% step to catch more subtle performance regressions. The higher the target reliability of a system, the longer you should let each step “bake” to detect problems.
If an ideal marketing launch sequence is 0-100 (everyone gets the new features at once), and the ideal reliability engineer launch sequence is 0-0 (no change means no problems), the “right” launch sequence for an app is inevitably a matter of negotiation. Hopefully the considerations described here give you a principled way to determine a mutually acceptable rollout. The graph below shows you how these various strategies might play out over an 8-day release window.
In short, we here at Google have developed a software release philosophy that works well for us, for a variety of scenarios:
At the end of the day, though, perhaps the best kind of launch is one where the features launched can be enabled independent of the binary rollout. That’s a blog post for another day.