Service Level Objectives

We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs).

They descibe metrics that matter, what values we want those metris to have and how we'll react if we can't provide the expected service.

Choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.

Indicators

A carefully defined quantitative measure of some aspect of the level of service that is provided.

Common SLIs:

Request latency
Error rate
System throughput
Availability - often defined in terms of the fraction of well-formed requests that succeed, sometimes called yield

Note for availability that availabilities of 99% and 99.999% can be referred to as "2 nines" and "5 nines" availability. GCE aims for "three and a half nines" — 99.95% availability.

Objectives

A target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI <= target, or lower bound <= SLI <= upper bound.

For example, we might decide that we will return Team search results "quickly," adopting an SLO that our average search request latency should be less than 80 milliseconds.

Choosing an appropriate SLO is complex. For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is essentially determined by the desires of your users, and you can’t really set an SLO for that.

On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment.

Agreements

an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial — a rebate or a penalty — but they can take other forms.

SRE doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs.