The Book of Google Experience: Site Reliability Engineering
Google published all experiences of its employees and what do they learn about those experiences into a book: “Site Reliability Engineering”. The book both highlights where will the system administration and software development works head to and include lots of significant information about how do they convert their experiences into the money.
To learn a job, we are studying for years, memorizing a thousand pages of theoretical information into our mind and it is not enough. As an intern, we are trying to intensify all theoretical knowledge we have learned and again it is not enough. We need to earn experience by working at least for a couple of years to able to get confident and say “ I learned”.
But at the end of the whole process, the only thing we learn is that most of the time the theory and the practice are not matching with each other. In time we are understanding that the theoretical knowledge we have learned in school is only enough to get to the start line of our work life. What matters is the ones we learned through experiences.
In this article, I will talk about the parts of the book that I found interesting and very important, but I recommend you to read the whole book. You can reach it online through this link.
I have heard the term “Reliability Engineering” for the first time in a workshop in 2001. To be honest, I couldn’t predict that as a discipline that has an area of usage especially in airplanes and space vehicles can have so much part in our lives. In Site Reliability Engineering, reading about the practice of this subject in the IT sector was like opening a big treasure box.
When I read what it is told in the book, I saw that I had passed similar roads like the ones who shared their experiences. But I saw a very unusual point of view and incredibly creative solutions in some of them. In this aspect, I can say it was a very eye-opening reading on my behalf. The three parts in the book that I got impressed the most are:
- (Embracing Risk)
- (Service Level Objectives)
- (Eliminating Toil)
I believe that these titles are quite to the point. Because the companies which supply Cloud services or IT infrastructure have to position correctly these three features in the institution or the business have been done to able to sustain their existence.
My biggest benefit: Principles
In the book the revolutionary change happened in modern-day technology is summarized with terms of DevOp and SRE, and the principles gained from the real experiences in this field have been discussed.
I care about this a lot because our approach to technologies and technical topics can change depending on the time and location, but to grow up our principles takes a very long time. I would like to share the principles that I am impressed the most and found worth to notice when I was reading.
The Principles of DevOps
“No More Silos (Versatilists)”
Till the beginning of 2010, it has been emphasized that specialize in one branch in your work life is very important and how we cannot carry two watermelons under one arm, it is also underlined how wrong to gravitate towards to other proficiencies at the same time.
But this point of view changed in the last 10 years. Nowadays the total opposite view is accepted as correct. Gartner foresees the % 40 of IT employees will be “Versatilist” until 2021. It can be understood that the term Versatilist means “ a person who has specialties on more than one discipline”.
In Site Reliability Engineering, it has been told that all occupational segregations exist right now such as software developer, operator, infrastructure developer, network engineer, will be evolved to the “SRE Engineer” term in the following years. I can tell that I am also experiencing this for a while in the sector.
Now one who is a system administrator should be a person who can both develop software and able to make changes on Storage Network. “Division of Silos” which has been defined with ITIL has been harmed badly and moving “Agile” can be only possible by having different disciplines between teams. For example, the Unix team cannot assign any issues to the Network team.
Because in the customer’s IT team the titles of “Network” and “Unix” are not available anymore. All kinds of issues are expected to be solved by SRE.
“Accidents are normal”
Whereas in the book, it is underlining that the accidents are normal and these are not occurring as a result of the personal mistakes. I completely agree with that.
Exposing the person who is the reason for that accident and blame him or her for that is the biggest mistake. Because that fact signifies the existence of mobbing within the company and causes the information to be hidden between the teams and the real reasons for these mistakes cannot be found. To not be in these troublesome situations, you need to calculate the possible accidents at the very beginning of building systems.
I would like to give a simple example from daily life. Let’s say you are setting up a server and are configuring the disks. If you use LVM, you can resize your disk online. In another case when you have a performance issue, you can carry your disk on the operating system.
In this way, you would be foreseen situations like the increase in capacity and performance issues since the beginning. Also, you would not be obliged to turn off the systems to process in your server.
“Change should be gradual”
Everyone knows the monthly project meetings and updated documents. Something being well-documented does not ensure of implementation of the big decisions given or the participants getting ready for the next meeting.
For this reason, in Site Reliability Engineering it mentions the continuity of change, and making the change happen with baby steps is very important. As an example, you would like to change the Monitoring vehicle that you are using in your company.
Trying to do it all at once and having big meetings about this topic would be a very significant challenge for you. Whereas, by extending the change over time, it is much easier to gradually adopt the second monitoring vehicle. Instead of having big monthly meetings, having small meetings during the process, and receiving feedback from users helps you to get results that you are demanding regarding your need.
“Tooling and Culture are Interrelated”
The tools that you are using and the corporate culture are related to one another. But your corporate culture should not change even the tools are changed. The tools that you are using should not determine your company’s work culture. In time other toolsets can change the content of the work you are doing.
You have established a company and your principle is to use automation. For that, you choose Puppet Automation Development. In terms of the SRE approach, your designs and process should not be created depending on the Puppet Development.
Because in Puppet development, whatever specialties you have today might not be there for tomorrow, also for the service you have created you should be developing the features that are not existing in the Puppet Development.
Whereas, if you designate your change management principles apart from the tool being worked on, these will have little change in time. Do not forget that, if we change our principles regarding our toolsets, then when we change our tools our principles do not exist and the new vehicles would be our principles.
“Measurement is crucial”
The title of one of the chapters in the book is “Service Level Objectives (SLO)”. When I read this part for the first time, I remember a teaching that I learned from a fact that I experienced when I was a very young and passionate engineer at the beginning of my professional life: “If you cannot evaluate the work you have done, then you are not the one who is doing that work.” Indeed evaluating your achievements and failures is very important for you too as much as for the people you are reporting.
In the IT sector, it is always spoken how the systems are not working, not how they are working. If you don’t determine your success criteria on your own by being aware of what have you done, then someone is always going to put the bar much higher place than where you suppose to reach. That constantly reduces the quality and the comprehensibility of your work.
So, Service Level Objectives (SLO) is a thing that should be evaluated as employees, departments, and the whole company and should be presented to users. Otherwise, whoever you report to (your manager, your boss, or the customers) would continue to maltreat you more in every issue.
The Principles of SRE
“Operations is a Software problem”
In reality, the original version of this quote in google’s book is: “All problems are software problems.” An amazing and very effectual long-established principle. Because everything related to IT, actually hold a software running in the background. Thereby, while Site Reliability Engineers (SRE) creating solutions to their issues, they should use the principles of software engineering.
Here it is more meaningful to give an example out of Google. AWS by developing all of its’ services separate from the hardware uses only Software Stack. In this way, they are the leader in the Public Cloud sector since they have been established because all services they develop are software services and the operational solutions which they have solved are the software solutions.
“Manage by Service Level Objectives(SLOs)”
“Work to minimize toil”
In this way, for example, you would add 100 servers to your 500 potentially problematic servers and you would have encountered to 600 potential problems. This kind of workload should be eliminated. Because a manual system interference that you will do with operational reasons steals from the time that you will save for the projects that can solve big traps which can affect many more systems.
“Automate this year’s job away”
The companies monitor their employees in terms of approximately how much time do they spend with what kind of work, the biggest problem in this Outsource companies does not have his kind of expectations. They think about a business model as we get a problem and our engineer spends lots of his time to solve this problem and in this way we earn money.
Unluckily in today’s world, this fact is not valid anymore. A must do is while sharing the service content with your customer, it is needed to include the automation into the service content, which means it is like an employee resource for your know-how company given to your customer.
As a result, “Site Reliability Engineering (SRE)” includes more than what I have told in here. I just shared some parts that I underlined to give some idea about the book. Especially CEO/CIO/CTO’s of IT companies and IT employees who are trying to see their future can benefit from this book significantly.
I wish you to enjoy your reading !