What Is Site Reliability Engineering

Google published all experiences of its employees and what do they learn about those experiences into a book: “Site Reliability Engineering”.

The book both highlights where will the system administration and software development head to and include lots of significant information about how do they convert their experiences into money.

To learn a job, we are studying for years, memorizing a thousand pages of theoretical information into our mind and it is not enough. As an intern, we are trying to intensify all theoretical knowledge we have learned and again it is not enough. We need to earn experience by working at least for a couple of years to able to get confident and say “ I learned”.

But at the end of the whole process, the only thing we learn is that most of the time the theory and the practice are not matching with each other. In time we are understanding that the theoretical knowledge we have learned in school is only enough to get to the start line of our work life. What matters is the ones we learned through experiences.

In this article, I will talk about the parts of the book that I found interesting and very important, but I recommend you to read the whole book. You can reach it online through this link.

I have heard the term “Reliability Engineering” for the first time in a workshop end of 90’s. To be honest, I did not expect that as a influential discipline. It only had a use case in airplanes and space vehicles, I did not expect to see so much deep influence on our lives.

During the read of “Site Reliability Engineering”, reading about the application of this discipline in the IT sector was like opening a big treasure box for me.

While I was reading the experiences in the book, I saw that I had similar problems/solutions like the ones who shared their experiences in the book. What was incredible is, I saw a very unusual perceptions and incredibly creative solutions . In this aspect, I can say it was a very eye-opening reading on my behalf. The three parts in the book that I got impressed the most are:

  • Embracing Risk
  • Service Level Objectives
  • Eliminating Toil

I believe that these titles are quite to the point. Because the companies which supply Cloud services or IT infrastructure have to position correctly these three features in their solutions. If they are not going to implement these principles into their solutions, I am afraid they can’t sustain their existence.

My biggest benefit: Principles

 

In the book, the revolutionary change happened in modern-day technology is summarized with terms of DevOp and SRE. Principles that emerged from the real experiences in this field have been discussed.

I find this approach really inspiring, because your(engineer) approach to technologies and technical topics can change depending on the time and location, but to have principles takes a very long time and they don’t change easily.

I would like to share the principles that I am impressed the most and found worth to notice when I was reading.

The Principles of DevOps

1. “No More Silos (Versatilists)”

Till the beginning of 2010, it has been emphasized that, specialize in one branch in your work life is very important. Like we cannot carry two watermelons under one arm, it is also underlined how wrong to gravitate towards to other proficiencies at the same time.

But this point of view changed in the last 10 years. Nowadays the total opposite view is accepted as correct. Gartner foresees the % 40 of IT employees will be “Versatilist” until 2021. It can be understood that the term Versatilist means “ a person who has specialties on more than one discipline”.

In Site Reliability Engineering, it has been told that all occupational segregations exist right now such as software developer, operator, infrastructure developer, network engineer, will be evolved to the “SRE Engineer” term in the following years. I can tell that I am also experiencing this for a while in the sector.

Now someone who is a system administrator should be a person who can both develop software and able to make changes on a Storage Network.

“Division of Silos” which has been defined with ITIL has been harmed badly and moving “Agile” can only be possible by having different disciplines between teams.

For example, the Unix team cannot assign any issues to the Network team.

Because in the customer’s IT team the titles of “Network” and “Unix” are not available anymore. All kinds of issues are expected to be solved by SRE.

2. “Accidents are normal”

 
I don’t know if it is natural or not but whenever we have a negative situation while working, we are always looking for someone to blame for. 
After finding that person, even if it does not help at all to find a solution for the mistake, we expose them. By exposing that person as responsible for the issue, we are having a kind of psychological relief. 
 

Whereas in the book, it is underlining that the accidents are normal and these are not occurring as a result of the personal mistakes. I completely agree with that.

Exposing the person who is the reason for that accident, blame him or her for that is the biggest mistake.

Because, that fact signifies the existence of mobbing within the company and causes the information to be hidden between the teams and the real reasons for these mistakes cannot be found. Not to be in this troublesome situations, you need to calculate the possible accidents at the very beginning of building systems.

I would like to give a simple example from daily life. Let’s say you are setting up a server and are configuring the disks. If you use LVM, you can resize your disk online. In another case when you have a performance issue, you can carry your disk live on the running operating system.

In this way, you would be foreseen situations like the increase in capacity and performance issues since the beginning. Means that mistakes are totally normal, and they came from the design. Because you need to forsee the disk increase from the beginning of your design.

3. “Change should be gradual”

 

Everyone knows the monthly project meetings and updated documents. Something being well-documented does not ensure of implementation of the big decisions given or the participants getting ready for the next meeting.

For this reason, in Site Reliability Engineering it mentions the continuity of change, and making the change happen with baby steps is very important. As an example, you would like to change the Monitoring tool that you are using in your company.

Trying to do it all at once and having big meetings about this topic would be a very significant challenge for you. Whereas, by extending the change over time, it is much easier to gradually adopt the second monitoring tool.

Instead of having big monthly meetings, having small meetings during the process, and receiving feedback from users helps you to get results that you are demanding regarding your need.

4. “Tooling and Culture are Interrelated”

 

The tools that you are using and the corporate culture are related to one another. But your corporate culture should not change even the tools are changed.

The tools that you are using should not determine your company’s work culture.In time other toolsets can change the content of the work you are doing.

You have established a company and your principle is to use automation. For that, you choose Puppet Automation Software. In terms of the SRE approach, your designs and process should not be created depending on the Puppet software.

Because in Puppet, whatever specialties you have today might not be there for tomorrow. Also for the service you have created, you should be developing the features that are not existing in the Puppet automation.

Whereas, if you designate your change management principles apart from the tool being worked on, these will have little change in time. Do not forget that, if we change our principles regarding our toolsets, then when we change our tools our principles do not exist. We would need to change our principles to meet the tool requirements.

5. “Measurement is crucial”

 

The title of one of the chapters in the book is “Service Level Objectives (SLO)”. When I read this part for the first time, I remember a lesson that I learned from a fact that I experienced when I was in my junior years. The lesson come from an awful downtime we experienced for days.

“If you cannot evaluate the work you have done, then you are not the one who is doing that work.”

Indeed evaluating your achievements and failures is very important for you too as much as for the people that you are reporting.

In the IT Industry, it is always spoken how the systems are not working, not how they are working. If you don’t determine your success criteria on your own by being aware of what have you done, then someone is always going to put the bar much higher place than where you suppose to reach. That constantly reduces the quality and the comprehensibility of your work.

So, Service Level Objectives (SLO) is a thing that should be evaluated as employees, departments, and the whole company and should be presented to users. Otherwise, whoever you report to (your manager, your boss, or the customers) would continue to maltreat you more in every issue.

The Principles of SRE

 

1. “Operations is a Software problem”

 

In reality, the original version of this quote in google’s book is: “All problems are software problems.” An amazing and very effectual long-established principle.

Because everything related to IT, actually is running a software in the background. Thereby, while Site Reliability Engineers (SRE) create solutions to their issues, they should use the principles of software engineering.

Here it is more meaningful to give an example out of Google. AWS by developing all of its’ services they use from the hardware stack. Their Cloud software is an overlay on top of any hardware that they use.

In this way, they are the leader in the Public Cloud sector since they have been established. Since all of their services they develop are software services, their operational problems are solved as software solutions. They are a huge proof of this principle.

2.“Manage by Service Level Objectives(SLOs)”

 
Designate a “Service Level Objective” to all services you have managed. Regarding this prediction, you can easily evaluate the duration of uptime for the services you would create in your company and whether it is successful or not.
 SRE should never promise %100 of uptime.
The thing that needs to be done is to specify an SLO (Service Level Objectives) for every produced service and observe a report that can evaluate this specification over the dashboard.
 

3. “Work to minimize toil”

 
All manual work that has been done for SRE are the abomination works and unfortunately, there is always something that needs to interfere manually.
 
The operational teams always have this dilemma. The new project arrives and you are asked to quit enhancements you are doing to take care of the new project.
In this way, for example, you would add 100 servers to your 500 potentially problematic servers and you would have encountered to 600 potential problems. 
This kind of workload should be eliminated.
Because a manual system interference that you will do with operational reasons steals precious time. That time could be used for the projects that can solve big traps which can affect many more systems.
 

4. “Automate this year’s job away

 

 Google encourages the Site Reliability Engineers  within its’ structure to use a maximum %50 of their total working times to eliminate all traps.
So as SRE you need to develop automated software that will solve the traps you will encounter continuously.

The companies monitor their employees in terms of  how much time do they spend with what kind of work. The biggest problem in this Outsourcing companies is that they don’t count this possible automated jobs, because they sell “Person Hour”.

Outsourcing Companies (MSP’s) think in a failed business model,  when they were assigned a problem their engineer spends lots of his time to solve this problem and in this way they earn money. Totally wrong business model!

Unluckily in today’s world, this fact is not valid anymore. A must do is while sharing the service content with your customer, service provider needed to include the automation into the service. That needs to be like an employee resource for your know-how company given to your customer.

Like installing a fully secure web service manually will take 8 hours and automation can do that with no mistake in 1 hour. Service provider should charge 8 hours, with an explanation to the customer.

As a result, “Site Reliability Engineering” includes more than what I have told in here. I just shared some parts that I underlined to give some idea about the book. Especially CEO/CIO/CTO’s of IT companies and IT employees who are trying to see their future can benefit from this book significantly. The principles should be decided company wise otherwise individual engineers would never be able to apply these practices alone.

I wish you to enjoy your reading !  

If you like this article please Don’t forget to subscribe.

We are not marketing or selling anything, we are against Hype makers and Futurists for imposing unreal truth. Plus we provide free Online Courses.