What is Site Reliability Engineering?

Site reliability engineering (SRE) is a trendy concept in the cloud world. Google started this practice in the early 2000s and shared it with the world in the seminal book Site Reliability Engineering.

It turns out that managing big systems at scale comes with a whole lot of challenges. Similar to the DevOps movement, SRE is about bridging the gap between development and operations. Let’s talk about SRE, shall we?

What Does Site Reliability Engineering Do?

Here’s what the source has to say:

SRE is what you get when you treat operations as if it’s a software problem.

A site reliability engineer helps application teams operate their applications at scale. It’s a dual role with two distinct responsibilities:

A software engineer that predominantly works in automating systems to make them more reliable and easier to operate
An operator that monitors and handles incidents for production systems as part of an on-call rotation

It’s important to note that in the original model, SRE is optional. Reliability engineers advise teams when they are preparing a release. Once the product hits a certain scale, SRE becomes part of the rotation and improves its operability.

At some point, their help is no longer needed, and they disengage. It’s not a static role. SRE isn’t about maintaining the status quo. Reliability engineers constantly work on improving the system.

Important Concepts in Site Reliability Engineering

There is enough information about SREs to fill books. I’m sticking to two here: toil and service level indicators (SLIs).

Toil is the kind of work tied to running a production service that tends to be manual and repetitive and that doesn’t provide enduring value. Toil is a necessary part of supporting live software. However, it can be dangerous. Engineers that only work on toil won’t have time to work on automation or improved observability. It can become a self-fulfilling cycle that stops progress and demotivates people until they burn out. Google aims to limit toil to 50 percent of an SREs time at most.

A service level indicator is a quantitative measure of some aspect of the service provided. On top of that, you define service level objectives (SLO), a target range for the SLI. This is a crucial part of maintaining agreements between teams. If you miss your SLOs, you need to work on bringing the system back to the right place. SLOs formalize an interesting intuition: you don’t need more reliability than necessary. Adding an extra nine of availability, adds an order of magnitude of effort. It’s only worth doing if business requirements demand it.

A Warning About SRE

There’s one important fact that many organizations forget: they aren’t Google. They don’t have the same scale, and thus they don’t have the same problems. Dedicated SRE teams make sense if you have a significant technological footprint.

What often happens in practice is that companies rename their operations team SRE. Mind you, they keep working on the same things for the most part. But at least they get a new title.

Needless to say, this doesn’t make much sense. Let’s say you run your applications in the cloud using managed services. In that case, following the “you build it, you run it” model might be a better approach. Application teams operate their services, and operations specialists support the teams as needed.

Acting as an SRE

Even if your organization doesn’t quite require SRE teams, you can still benefit from the principles behind it. Managing risk, SLOs, monitoring, automation, and reducing toil are all worthwhile.

As an engineer, the easiest way to get experience in this area is to operate production software. Many teams have that kind of ownership, and people are generally happy whenever new team members are interested in learning more. Additionally, you might want to consider some training. If you want to develop yourself in that direction, be sure to check out Cprime’s SRE training. This course is about learning SRE best practices based on Google’s approach to the role.

Implementing Site Reliability Engineering

Mario Fernandez

Mario develops software for a living—then he goes home and continues thinking about software because he just can't get enough. He’s passionate about tools and practices, such as continuous delivery. And, he’s been involved in frontend, backend, and infrastructure projects.

Enterprise Agility Need to respond to change faster? To do more with less? To surpass your competition? Adopting a holistic approach to change and continuous improvement across the organization can achieve all that and more Learn more >

Global TalentElevate your pool of talent to beat the global tech talent shortage and remain competitive in the marketplace with end-to-end solutions for enhancing your tech teams Learn more >

Development Support lean, cost-effective workflows focused on delivering value to your customer by leveraging individual specialists or entire teams of experienced software engineers to build custom applications and integrations Learn more >

Business Technology Establish the optimal tool stacks to streamline workflows, data capture, and transparency across the organization, supporting decision making and agility Learn more >

Training From new ways of working to deeply technical tools-based topics, leverage 30 years of experience to bridge skills gaps, empower excellence, and foster innovation for unmatched growth. Cprime Learning >

Pages

Courses

Resources

Blogs

What is Site Reliability Engineering?

What Does Site Reliability Engineering Do?

Important Concepts in Site Reliability Engineering

A Warning About SRE

Acting as an SRE

Implementing Site Reliability Engineering

Mario Fernandez

Enterprise Agility Need to respond to change faster? To do more with less? To surpass your competition? Adopting a holistic approach to change and continuous improvement across the organization can achieve all that and more Learn more >

Global TalentElevate your pool of talent to beat the global tech talent shortage and remain competitive in the marketplace with end-to-end solutions for enhancing your tech teams Learn more >

Development Support lean, cost-effective workflows focused on delivering value to your customer by leveraging individual specialists or entire teams of experienced software engineers to build custom applications and integrations Learn more >

Business Technology Establish the optimal tool stacks to streamline workflows, data capture, and transparency across the organization, supporting decision making and agility Learn more >

Training From new ways of working to deeply technical tools-based topics, leverage 30 years of experience to bridge skills gaps, empower excellence, and foster innovation for unmatched growth. Cprime Learning >

Pages

Courses

Resources

Blogs

What is Site Reliability Engineering?

What Does Site Reliability Engineering Do?

Important Concepts in Site Reliability Engineering

A Warning About SRE

Acting as an SRE

Implementing Site Reliability Engineering

Mario Fernandez

You may also be interested in:

Empowering Developers: The Key to Achieving Unparalleled IT Resilience

Survey Results: Generative AI in Software Development Teams—Productivity and Challenges

Measuring Developer Productivity—In Defense of “Developer Intelligence”