SRE 101

Luke Leslie Swithenbank
Expert360 Engineering
4 min readJul 11, 2018

--

I’d like to try and distill the most basic concepts of what SRE does when it comes to deploying code. This is my attempt using a 1995 analogy.

I have some code I wrote and I want to get it to our Customers.

Well I’m still living in 1995 so the only way to get your code to customers is to put it on a CD. Once its on a CD I can mail it to our customers.

The CD you sent is scratched?

Look, we should probably send multiple CDs. It’ll cost a bit more, but it means that we don’t have to keep paying for shipping if 1 is scratched.

The computer doesn’t have a monitor attached, how do I know the CD is booting our code?

You’ll have to go to the computer and listen to the CD drive spinning. We realise this might not be the best check so we are going to upgrade your CD drive to get a LED light. Green for working, Red for not working.

The green light went on but the CD still isn’t running?

Look, I think we may have a problem with this version of the CD. Do you have an old copy of the CD you can use instead? I’ll send you a newer version soon to try and figure out why this isn’t working for you. Somehow this CD works in our office but not in yours. I guess you could say this “works on my machine”.

This new version works! but after 10 minutes it stops working.

I’ve upgraded the CD drive again. If the CD drive detects that the LED light goes to red and the version stops working, it reboots the CD.

The CD rebooting every 10 minutes is really jolting for someone using your software. They seem to race to try and do anything before the 10 minutes is up. Can we do anything to make it better?

Alright alright, I’ve upgraded the CD drive and now it load 2 CD’s at once. If one of the CDs fail, it immediately switches to the other CD without anyone noticing. The second CD is in the exact place that the first one stopped at.

This software is amazing!!! There are 10 of us now using the computer so that we can run your software. It seems that when 10 people use your software though, it freezes.

We didn’t design the CD to be used by multiple people but we love that you love our code sooooo much! Since you are a valued customer, we decided to buy you a larger computer so that all 10 of you can use it.

Too many customers want a larger computer. This is costing us too much.

Ok, how about instead of sending them a new computer, we just pay for leasing someone else’s larger computer? That way we only need to pay for what we use.

So based on the conversation above you can see how SRE was formed. It came out of a need to run the software when things went wrong. Here are the formal definitions for all the concepts above.

Formal SRE Definitions

Packaging

Packaging code is the art of putting that code into a form in which we can run it on any computer that the customer may have. Back in the day everyone would assume you are running Windows 95 and just make a CD rom with a .exe file that booted up when you gave it to them. This is essentially what packaging code is.

Deployment

A deployment is sending code to where it needs to run. This might be mailing someone the CD you made for them or it might be sftp-ing the files directly onto the server. A deployment also includes making sure that once its on the machine it actually is running. Usually a simple health check (like the green light above) is what is used here to make sure that the thing you sent got to the other end without any issues.

Redundancy

Often during a deployment, we want more than 1 application running. The analogy of having a scratched CD is an example. By not relying on 1 specific CD, but on making sure that at least 1 of them works, you can improve the reliability.

Quick Restarts

Often rebooting your computer would take 10–20mins. Engineers know the value in simple turning it off and on again, so instead of throwing this idea away, we just make it quicker. Often we can reboot something within 5 seconds or less if its done right.

High Availability

Alright so rebooting helps, but it is still noticeable when things are being rebooted. This is a bit more complex, but systems are now being designed to be highly available. Something that is highly available all its doing is swapping from a broken thing to something that’s working. An example might help here:
We can run 4 apps simultaneously but let the customer only see 1. If the one they are using breaks, something that is highly available will swap it out for one of the other 3. That swapping is done extremely quickly (1 second or less) and is done in such a way that the customer doesn’t notice.

Scaling

Scaling is the ability to serve more than what you anticipated without making changes to the code. This means that I can just run more applications on larger machines and be smart about when and how to run them so that I don’t pay for more than what I use.

Hopefully this gives you a basic idea of what SRE does and the terms we use.

--

--