Site reliability engineering (SRE) is a trendy concept in the cloud world. Google started this practice in the early 2000s and shared it with the world in the seminal book Site Reliability Engineering.
It turns out that managing big systems at scale comes with a whole lot of challenges. Similar to the DevOps movement, SRE is about bridging the gap between development and operations. Let’s talk about SRE, shall we?
What Does Site Reliability Engineering Do?
Here’s what the source has to say:
SRE is what you get when you treat operations as if it’s a software problem.
A site reliability engineer helps application teams operate their applications at scale. It’s a dual role with two distinct responsibilities:
- A software engineer that predominantly works in automating systems to make them more reliable and easier to operate
- An operator that monitors and handles incidents for production systems as part of an on-call rotation
It’s important to note that in the original model, SRE is optional. Reliability engineers advise teams when they are preparing a release. Once the product hits a certain scale, SRE becomes part of the rotation and improves its operability.
At some point, their help is no longer needed, and they disengage. It’s not a static role. SRE isn’t about maintaining the status quo. Reliability engineers constantly work on improving the system.
Important Concepts in Site Reliability Engineering
There is enough information about SREs to fill books. I’m sticking to two here: toil and service level indicators (SLIs).
Toil is the kind of work tied to running a production service that tends to be manual and repetitive and that doesn’t provide enduring value. Toil is a necessary part of supporting live software. However, it can be dangerous. Engineers that only work on toil won’t have time to work on automation or improved observability. It can become a self-fulfilling cycle that stops progress and demotivates people until they burn out. Google aims to limit toil to 50 percent of an SREs time at most.
A service level indicator is a quantitative measure of some aspect of the service provided. On top of that, you define service level objectives (SLO), a target range for the SLI. This is a crucial part of maintaining agreements between teams. If you miss your SLOs, you need to work on bringing the system back to the right place. SLOs formalize an interesting intuition: you don’t need more reliability than necessary. Adding an extra nine of availability, adds an order of magnitude of effort. It’s only worth doing if business requirements demand it.
A Warning About SRE
There’s one important fact that many organizations forget: they aren’t Google. They don’t have the same scale, and thus they don’t have the same problems. Dedicated SRE teams make sense if you have a significant technological footprint.
What often happens in practice is that companies rename their operations team SRE. Mind you, they keep working on the same things for the most part. But at least they get a new title.
Needless to say, this doesn’t make much sense. Let’s say you run your applications in the cloud using managed services. In that case, following the “you build it, you run it” model might be a better approach. Application teams operate their services, and operations specialists support the teams as needed.
Acting as an SRE
Even if your organization doesn’t quite require SRE teams, you can still benefit from the principles behind it. Managing risk, SLOs, monitoring, automation, and reducing toil are all worthwhile.
As an engineer, the easiest way to get experience in this area is to operate production software. Many teams have that kind of ownership, and people are generally happy whenever new team members are interested in learning more. Additionally, you might want to consider some training. If you want to develop yourself in that direction, be sure to check out Cprime’s SRE training. This course is about learning SRE best practices based on Google’s approach to the role.