In today’s highly technical world, all companies need to build websites which are reliable, efficient, and scalable. In 2003, Google tasked a team of software engineers to address this, and they did so well that other big tech companies copied them.
This was the birth of site reliability engineering (SRE), which has now become an important IT domain that develops solutions for operational aspects to improve the reliability and performance of websites, especially larger ones. The goal is to automate as much of operations as possible while continuing to develop systems to improve the site. SRE is not DevOps, although it can be part of it.
For many companies, this means hiring a site reliability engineer, and larger companies need an entire SRE team. For best results, your SRE team needs to be high-performing, agile, and follow the best practices that were started by Google and expanded by others. For companies that don’t need an SRE team, either handing SRE over to development or outsourcing to an SRE-as-a-service model are options, but for the purposes of this article we are assuming that you are indeed building a full SRE team for your company.
SRE Best Practices for Establishing your Team
When setting up your SRE team, Google has some suggestions. They recommend a team of at least eight people for on-call/operational duties. SRE teams, however, should spend no more than 50% of their time on operational work. Rather than inflicting overflow on your SRE team, include the development team in the on-call rotation. This not only frees the SRE team up for their real job, but helps the development team better understand the needs and desires of your end users.
For large companies, it’s recommended that you have two SRE sites in different timezones so as to reduce the number of pages during the night. Obviously, this is not feasible for everyone, but it does improve quality of life if it can be managed. If you do have people on call at night, work with your team; some people don’t mind being on the night shift, others can get physically ill if they try. Take into account needs and preferences when scheduling on-call shifts.
Assume that your team can only handle two events per shift; but the idea of SRE is to reduce the number of events as close as possible to zero. If they are doing their job, your team may start getting rusty on incident response, so make sure to drill them regularly to help them stay sharp and improve incident handling. When establishing the team, though, you probably want enough load to make sure they have to deal with an incident at least weekly.
Don’t fall into the trap of just renaming part of your development team to SRE without doing the proper training and upskilling. Make sure you implement SRE properly and set up the balance of responsibility between SRE and development.
Treat IT Operations as a Value Item
SRE is worth it. SRE is also expensive. Google often struggles to hire good site reliability engineers. Traditionally, IT is often treated as an area where costs can be cut. Instead, you need to consider the costs of downtime, and what having a good site can do to maximize your revenue. Downtime can cost even a small business an average of $10,000 an hour.
SRE teams are best kept small and you should hire the best people you can afford and/or give people the best training available. SRE is very much about finding out what the mundane activities to run your site are and automating those activities away. It’s about reducing not just downtime but the human effort put into reducing downtime and restoring service.
By measuring your IT budget by how much money good IT saves you, you can better justify the expense of hiring and training good people.
A great SRE team, thus, is not made up of lots of low level people, but a few really good ones. With the difficulty of hiring trained SRE, then you may want to start with people you already have and make them into site reliability engineers.
What Should You Look For in People to Add to the Team?
So, if hiring fully-fledged SREs is difficult, what should you look for in people to add to the team? Maybe you can find and afford one already-trained person, but need to look elsewhere for candidates. Here are some considerations:
- The general assumption is that you need people with a computer science degree, but this is not necessarily the case. Opensource.com did a survey of SREs and found that 20% of them had no degree at all and another 19% had a degree in some area including, of all things, zoology. In other words, don’t restrict your search by degree.
- The top non-technical skills needed are: Problem solving, teamwork, composure under pressure, written communication, and verbal communication. It is easier to train technical skills than these transferable skills which are as much about how somebody works as what they are doing.
- Candidates need to be able to solve problems quickly, including ones they might not know how to solve right away. They should not be people who know, or rather think they know, all the answers.
- For technical skills, look for experience with cloud based software and infrastructure automation.
- Try to get a team that has different backgrounds, both in terms of education and in terms of life. This helps ensure that somebody on the team knows the answer, while also bringing together different methods of problem solving.
The most important qualities are an eagerness to learn and a desire to understand how the system works, as well as the urge to automate away the boring stuff. Avoid, thus, bringing people onto the team who find doing rote work relaxing, as they’re unlikely to have the precise drive needed. A good SRE, according to Google, is trying to automate themselves out of a job. The goal is to reduce human involvement and thus you need people who are determined to make their life easier.
SLOs and Targets
Another quality of a good team involves your expectations. It’s important that these are made clear to both internal and external hires being brought onto the SRE team (some companies have slightly different definitions of what site reliability engineering covers).
A key way of establishing these is to have a service level objective. As a note, 100% uptime is not a reasonable objective. Downtime will happen, despite every effort to minimize it. It can happen due to circumstances completely beyond your control, such as a power outage. Higher automation reduces human error but can increase the risk of downtime due to instabilities in your code.
Saying you want 100% reliability will simply put too much stress on your team. Instead, set a reasonable target that takes into account the needs of your company and your customers. For example, your highest priority might be making sure that your ecommerce store is always able to take orders. For another company it might be a back end database that needs to be available consistently. Your SLOs should be personalized and take into account your past reliability issues, encouraging stepped goals to get reliability as high as feasible.
Instead, you set an “error budget.” This is listed as one minus your SLO, so if your current objective is 98%, that is a 2% error budget. Going out of the error budget should not be punitive, but rather treated as a way to prioritize team time. When you go outside the error budget, your SRE team should stop working on changes and releases except for vital patches and focus instead on fixing the SLO miss. This is particularly important if you have a bug that is causing the budget to be exceeded, if you find a hard dependency that can be softened, or if you are miscategorizing errors. However, you should give a pass if the outage was caused by, for example, a company-wide networking problem, a service maintained by a different team, or external factors such as internet problems. In some cases the budget may appear to be exceeded because errors have been miscategorized, so this is always something which should be checked in both directions.
You may eventually be able to tighten your SLO target as your team works on the automation necessary to reduce the number of errors. If you are consistently way under your error budget, then it is probably time to move the goalposts. Another metric to look at is, of course, operational involvement or toil. The lower this is, the better your team is doing. Both measures together provide actionable, quantifiable goals that your team can use to ensure they are doing their job well.
Again, though, you can’t expect 100% reliability from any service. Give your team the space to improve without putting that level of pressure on them. Calculating reliability cost is an important part of establishing a good SRE team and making sure that the goals they have are reasonable.
Which brings us to:
A Great SRE Team Needs Respect
We all know that there is a tendency to take IT for granted until something goes wrong. Then, of course, it’s entirely their fault. This leads to a high turnover in IT departments and often to people getting out of IT altogether.
It’s vital to ensure that this does not happen to your site reliability engineering team. Because of the high level of teamwork and system understanding they need, you need to keep your team together as much and as long as possible.
A really good team will eventually be less needed, but a team which is hemorrhaging members is not going to get the job done. You need to make sure that your SRE team feels wanted and needed.
Ways to do so include:
- Regular meetings with development leadership, ensuring everyone is on the same page.
- Ensuring that SRE and developers work together on project planning and execution. This ensures that the work done by SRE is more visible; work preventing downtime often goes unnoticed until it fails.
- All reviews and postmortems are shared between developers and SRE.
- The SRE team produces a long-term plan, usually annually. They should work on this with the developers. The plan should be part of how they work towards automating incident response and reducing operational involvement.
- Team members get credit for improvements they make. Eventually this should include improvements that go beyond “firefighting” and normal operations. An advanced SRE team should have at least one member who has contributed a major improvement.
- Do periodic performance reviews that include positive feedback, constructive criticism, and time spent listening to what each team member wants.
- Postmortems should always be conducted in a way that is blameless; make sure whoever is moderating the postmortem meeting knows to steer the topic away from fault and step in if team members are in conflict.
- Embedding site reliability engineers in development teams, whether long term or for specific projects. This helps each side learn what the other does and allows the SRE team to help with communication between development teams, which can improve efficiency and make everyone feel more integrated and valued.
Overall, it’s very important to recognize the importance of your SRE team and honor them for their achievements. Make sure that everyone in the company knows how important these people are and be transparent about cost savings and improved revenue that come from increased reliability.
All of these are important things to consider either when starting a new SRE team or improving the one you already have. At Cprime, we offer our three-day Implementing Site Reliability Engineering course either to individuals or entire teams. We will run the course either face to face or online and offer private corporate training in both formats. There are no prerequisites for the course, which we recommend to both prospective team members and IT leadership including CIOs/CTOs. Contact us to find out more and discuss our corporate training options.