By AJ Ross and Adrian Hilton, Customer Reliability Engineers, and Dave Rensin, Director of Customer Reliability Engineering
Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.
We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load balancing between the two.
Suppose that we decide that running our aforementioned Shakespeare service against a formally defined SLO is too rigid for our tastes; we decide to throw the SLO out of the window and make the service “as available as is reasonable.” This makes things easier, no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.
Unfortunately for you, customers don’t know that. All they see is that Shakespeare searches that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.” Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there’s no way to measure whether or not this a significant issue with the service. and you cannot terminate the escalation early with “Shakespeare search service is currently operating within SLO.” As our colleague Perry Lorier likes to say, “if you have no SLOs, toil is your job.”
A common pattern is to start your system off at a low SLO, because that’s easy to meet: you don’t want to run a 24/7 rotation, your initial customers are OK with a few hours of downtime, so you target at least 99% availability — 1.68 hours downtime per week. But in fact, your system is fairly resilient and for six months operates at 99.99% availability — down for only a few minutes per month.
But then one week, something breaks in your system and it’s down for a few hours. All hell breaks loose. Customers page your on-call complaining that your system has been returning 500s for hours. These pages go unnoticed, because on-call leaves their pagers on their desks overnight, per your SLO which only specifies support during office hours.
The problem is, customers have become accustomed to your service being always available. They’ve started to build it into their business systems on the assumption that it’s always available. When it’s been continually available for six months, and then goes down for a few hours, something is clearly seriously wrong. Your excessive availability has become a problem because now it’s the expectation. Thus the expression, “An SLO is a target from above and below” — don’t make your system very reliable if you don’t intend and commit to it to being that reliable.
Within Google, we implement periodic downtime in some services to prevent a service from being overly available. In the SRE Book, our colleague Marc Alvidrez tells a story about our internal lock system — Chubby. Then, there’s the set of test front-end servers for internal services to use in testing, allowing those services to be accessible externally. These front-end servers are convenient but are explicitly not intended for use by real services; they have a one business day support SLA, and so can be down for 48 hours before the support team is even obligated to think about fixing them. Over time, experimental services that used those front-ends started to become critical; when we finally had a few hours of downtime on the front-ends, it caused widespread consternation.
Now we run a quarterly planned-downtime exercise with these front-ends. The front-end owners send out a warning, then block all services on the front-ends except for a small whitelist. They keep this up for several hours, or until a major problem with the blockage appears; the blockage can be quickly reversed in that case. At the end of the exercise the front-end owners receive a list of services that use the front-ends inappropriately, and work with the service owners to move them to somewhere more suitable. This downtime exercise keeps the front-end availability suitably low, and detects inappropriate dependencies in time to get them fixed.
At Google, we distinguish between a Service-Level Agreement (SLA) and a Service-Level Objective (SLO). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they’ll push hard to keep it within SLA.
Because of this, and because the principle availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over 1 month with an internal availability SLO of 99.95%. Alternatively the SLA might only specify a subset of the metrics comprising the SLO.
For example, with our Shakespeare search service, we might decide to provide it as an API to paying customers in which a customer pays us $10K per month for the right to send up to one million searches per day. Now that money is involved, we need to specify in the contract how available they can expect the service to be, and what happens if we breach that agreement. We might say that we’ll provide the service at a minimum of 99% availability, following the definition of successful queries given previously. If the service drops below 99% availability in a month, then we’ll refund $2K; if it drops below 80% then, we’ll refund $5K.
If you have an SLA that’s different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You’ll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries (we might not mind dropping queries from non-paying users if we have to start load shedding, but we really care about any query from the paying customer that we fail to handle properly). That’s another benefit of establishing an SLA — it’s an unambiguous way to prioritize traffic.
When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, suppose that you give each of three major customers (whose traffic dominates your service) a quota of one million queries per day. One of your customers releases a buggy version of their mobile client, and issues two million queries per day for two days before they revert the change. Over a 30-day period you’ve issued approximately 90 million good responses, and two million errors; that gives you a 97.8% success rate. You probably don’t want to give all your customers a refund as a result of this; two customers had all their queries succeed, and the customer for whom two million out of 32 million queries were rejected brought this upon themselves. So perhaps you should exclude all “out of quota” response codes from your SLA accounting.
On the other hand, suppose you accidentally push an empty quota specification file to your service before going home for the evening. All customers receive a default 1000 queries per day quota. Your three top customers get served constant “out of quota” errors for 12 hours until you notice the problem when you come into work in the morning, and revert the change. You’re now showing 1.5 million rejected queries out of 90 million for the month, a 98.3% success rate. This is all your fault: counting this as 100% success for 88.5M queries is missing the point and a moral failure for measuring the SLA.
SLIs, SLOs and SLAs aren’t just useful abstractions. Without them you cannot know if your system is reliable, available, or even useful. If they don’t tie explicitly back to your business objectives then you have no idea if the choices you make are helping or hurting your business. You also can’t make honest promises to your customers.
If you’re building a system from scratch, make sure that SLIs, SLOs and SLAs are part of your system requirements. If you already have a production system but don’t have them clearly defined then that’s your highest priority work.
As an SRE (or DevOps professional), it’s your responsibility to understand how your systems serve the business in meeting those objectives, and, as much as possible, control for risks that threaten the high-level objective. Any measure of system availability that ignores business objectives is worse than worthless because it obfuscates the actual availability, leading to all sorts of dangerous scenarios, false senses of security and failure.
For those of you who wrote us thoughtful comments and questions from our last article, we hope this post has been helpful. Keep the feedback coming!
N. B. Google Cloud Next ’17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai, and other luminaries for three days of keynotes, code labs, certification programs, and over 200 technical sessions. And for the first time ever, Next ’17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.