Agility Requires Safety
To go faster in a car, you need not only a powerful engine, but also safety mechanisms like brakes, air bags, and seat belts. This is a talk about the safety mechanisms that allow you to build software faster.
The talk is based on the book “Hello, Startup”, which you can find here: http://www.hello-startup.net/
- Yevgeniy (Jim) BrikmanCo-founder, Gruntwork
Show video transcript
Hi, everyone. This talk is called Agility Requires Safety. Where this comes from, is during my career I’ve had the opportunity to talk with an awful lot of tech companies, and I often hear this very weird sentence that sounds something along the lines of “we don’t have time for best practices” or sometimes you hear “we don’t have time to do it right” And so you ask about monitoring and alerting and you get basically laughed out of the room, right? Don’t bother asking about documentation. That’s not even an option on the table. Maybe you thought they might be using devops practices.
Nah, they’re just tossing things over the wall and hoping for the best and there’s no tests. There are no tests of any meaningful kind. They just kind of throw everything into production. And so the result is the experience of building software at so many tech companies is something that looks a little bit like that, right? You’re trying to build something simple. You’re trying to do some basic operation and things are breaking. Things are falling apart. Everything is coupled. Everything is on fire and it’s just, it’s an awful, painful experience. So, one of the realizations I’ve had in my career, is I think software people think that if we just throw away all these best practices and we just kind of slam down and go as fast as we can, we’re somehow going to get things done faster. And, in general, and especially in the long term, I don’t think that’s true.
I don’t think you can go faster by being reckless, right? I shudder to think of what would happe if a construction team that’s building a skyscraper decided we don’t have time for best practices. We don’t have time to get this right. We’ll just get it done as quick as we can, right? I shudder to think of what people would do if they’re on the highway and they start thinking like this, right? You’re sitting in traffic, you’re bored and you’re like, I want to get to work faster. You know what I’m going to do, forget these best practices, forget speed limits and laws. I’m just going to slam down on my gas pedal and go as fast as I can.
The result is pretty predictable, right? And honestly, this is what software engineering often looks like and often feels like. The key insight that I want to share and capture in this talk is that the reality for most people, is what limits your speed in a car isn’t the power of the engine. Modern cars have really powerful engines and most of us aren’t using even half of what those engines can do. What limits our ability to go fast, is we would die if we want too quickly, right? It’s actually our safety mechanisms that limit our speed. So it’s things like brakes. Fast cars need really powerful brakes.
You also need things like seatbelts and bumpers and airbags and autopilot. And the more of these we have, the faster we’ll be able to go. We’re not limited by the engine. And I would say in software the same is also true. We’re not limited by typing speed. You can definitely type out way more code than you can actually ship, because if you tried to ship all of it, it would just break everything, right? That’s the limit. Safety is the actual speed limit for most of us. So, the question I’m asking in this talk is: what are the seatbelts, the brakes, and other mechanisms of software? What are the safety mechanisms that we should be using? And specifically, what are the safety mechanisms we can put in place that will allow us to go faster? Putting the safety mechanism in place has a cost.
It’ll take some time, but we want these mechanisms that pay off massively, and let us go much faster as a result. I’m Yevgeniy Brikman, often go by the nickname Jim. I’m the co-founder of a company called Gruntwork, where we provide devops as a service and we help a lot of companies with infrastructure and safety mechanisms. Also, the author of a couple books, Terraform: Up & Running, which is all about infrastructure as code. You’ll hear about that later in the talk and, Hello Startup, which is about a lot of startup topics, but has a whole chapter dedicated to software delivery where I talk about a lot of the same ideas.
So, here’s the outline for the talk. I’m going to go through four safety mechanisms. I’m going to use an analogy for each one, and then as we get into it, you’ll see what the software equivalent is of each one of these. So, let’s get rolling. We’re going to start with brakes. So, let’s get rolling. We’re going to start with brakes. As we talked about, good brakes are essential on cars. In fact, the faster the engine, the bigger the engine, the better the brakes need to be and they prevent you from running into things you really don’t want to run into.
The equivalent in the software world to brakes is continuous integration and automated testing. So that’s what we’re going to focus on here. And, I want to pause and spend a little bit of time on continuous integration because I think a lot of people don’t deeply understand what continuous integration really is. A lot of people just think, oh, it’s Jenkins or it’s GitLab, and there’s a little more to it than that. So let’s look at an analogy and get a good sense of what continuous integration really is. Imagine you were assigned to build out the International Space Station, right? This giant spacecraft and you decided the way you’re going to do it, is you’re going to split up into a bunch of pieces and you’re going to assign each piece to one country.
That country is going to look at the plan and they’re going to go away and for years, maybe decades, they’re going to work on that thing completely in isolation. They’re not really going to talk to each other, check, nothing. They’re just going to work in isolation. And when everybody’s done, you’re going to launch things into outer space and put them all together. How’s that going to work out? Probably not very well, right? One of the teams is going to go, oh, wait a minute, I thought that the Russians were gonna be the ones that are going to do the bathrooms and didn’t they do it? No?! Someone’s going to say, wait a minute, I thought the French team was responsible for all the wiring.
And of course, most of the teams are gonna be like, well, it’s okay, everyone’s using metric, right? There isn’t like one country out there that just happens to not be using the metric system, right? Here’s the issue, when teams are working for a long time in isolation, they start to create these false, incorrect assumptions. And, figuring out what assumptions you got wrong at the very end, when you’re trying to launch, is way too late. That’s a very expensive way to learn that lesson. So, this idea, what I just showed you here, that’s essentially late integration, and a lot of teams build software this way.
They use feature branches, right? Each team has its own branch and they’re all working completely in isolation sometimes for months at a time, building whatever it is, not really integrating with each other, not putting their work together until the very end. At the very end, maybe once every three months or six months, they try to do some kind of massive release, and to do that they have to merge all their work together and the result is a gigantic merge conflict. And I don’t mean just a merge conflict that is, you know, a little text got changed here and here, and how do we put it together? I mean these teams have giant fundamental conflicts in what they’re putting together. Maybe the team in this blue branch at the top, they were working using a library that the team in the green branch just deleted, and so now you have 10,000 lines of code written around a library that doesn’t even exist anymore.
These assumptions can cause these fundamental issues that can take weeks or months to resolve, and they’re very hard to resolve, and you might not even realize it until you put the code in prod and that you’ve had these crazy incorrect assumptions. So, the alternative to building things that way is what’s known as continuous integration. And the core of continuous integration is this sentence right here: the goal is not about C-I servers or any of that. It’s about regularly merging your work together. Very, very regularly. Ideally every single day, but the key is don’t go for months without merging together. Very regularly you put all your work together, and so all of those incorrect assumptions get flushed out immediately.
Now, there’s a bunch of ways to do continuous integration. And the most popular is what’s called trunk-based development. And the idea here, is that the way you merge work together is you basically force everybody to work on a single branch typically called trunk, or master, or main. So everybody on your team, all the developers are all merging their work on a very regular basis perhaps daily to this one branch. Now, when I tell people about this and explain what trunk-based development is, I usually get one of two reactions. One is people who have done it and they’re like, makes sense, love it. And then the other, is from folks have never done it and they simply do not believe it’s possible. It sounds ridiculous. And so, I start getting all sorts of questions about it.
One of those questions is: okay, there’s no way this can scale. Sure, you can do trunk-based development with a team of three but I have dozens, hundreds, thousands of developers on my team. There’s no way it can scale to that. The reality is trunk-based development might be the only thing that scales. Most, or I would say many, of the major software companies in the world use trunk-based development. LinkedIn. Facebook. Google. Amazon. They all have thousands and thousands, maybe even hundreds of thousands developers committing to the same repo, to the same branch. They all do trunk-based development. So, it definitely scales. Google’s numbers in particular are just astonishing. These are numbers they published back in 2015. So I’m sure the numbers have grown since. But, their source, they have a single repo with two-billion lines of code and 45,000 commits per day. All around trunk-based development.
So yeah, it scales. You don’t need to worry about that. Then the next question I get is: okay fine, fine, maybe it scales, but wouldn’t you just have merge conflict all the time, right? If the merge conflict was the big thing, well, if we’re all just merging together, then I’d be dealing with conflicts every day. The reality is that’s actually not what happens. When you’re doing feature branches, merge conflicts are pretty likely because maybe you have two teams, and for three months, they’re working across the code-base, and so the odds that those two teams touch the same files, in perhaps incompatible ways, they’re pretty high, throughout the period. But with continuous integration if you’re merging code into master every day, and you’re pulling the latest from master every day, the odds you happen to modify two files at the same time are a lot lower, and even more importantly, if you did modify those files at the same time, well, it’s only a day of work to merge.
It’s something you just did yesterday, right? So it’s actually really easy to fix these merge conflicts. They don’t result in, you know, these cascading thousands of lines of code that they need to be cleaned up. And the thing to remember here is: merge conflicts are part of the process. There’s no way to avoid them, right? You’re going to be touching the same code. So the whole point of continuous integration is you’re solving these early and often, and that’s a really big deal. In fact, this is a common practice. This is another big part of a safety mechanism, this committing early and often. Small commits have huge advantages, right? They’re easier to merge. They’re easier to test. They’re easier to revert. They’re much easier to code review as well, right? We’ve all seen the code review that looks like this, right? You put up a pull request and it’s ten lines of code. You have ten comments on it.
You put up a pull request with 500 or 5,000 lines of code. That looks fine, ship it, right? That’s how code reviews work. So small commits are really, really valuable and continuous integration encourages and makes heavy use of small commits. Okay, so then the next question is: okay fine, maybe it scales, maybe the merge conflicts aren’t a big deal, especially if the commits are small, but wouldn’t the code on trunk always be broken? And so now, this is where those automated tests come in. This is the other key, incredibly important part of this particular safety mechanism. So the idea is, you can figure a self-testing build.
In other words, after every single commit, the build runs a set of tests, right? They compile the code, they do, run linter tools, they run automated tests, do a whole bunch of checks to make sure the code is actually working the way you expect. So this is where those C-I servers like Jenkins finally come into the picture. And the key point here is that: if the build fails, if some test fails, then more or less you kick the code out of trunk, right? You might revert it automatically immediately or maybe give the developer a little bit of window time to try to fix it if it’s something minor, but at the end of the day, broken code does not stay in trunk for more than a matter of you know minutes and usually it’s kicked out right away. That’s a really, really big deal.
Now, of course, getting benefits from this does depend on having a good suite of automated tests. And this is where a lot of the investment into this particular safety mechanism comes in, is creating the C-I system that’s going to run your tests and building a solid suite of automated tests. So an important question asked is, what should you test? Now, there are some testing purists who will tell you everything. You have to have a 100% code coverage. You have to do everything through T-D-D, etc. etc. I don’t really believe that and from most of the companies I’ve worked with that’s not what happens in the real world.
The reality is you choose what to test by considering it as trade-off. And it’s a trade-off between a few key things, and those are the likelihood of bugs, the cost of those bugs, and the cost of testing. So likelihood of bugs. Certain types of code are more likely to have bugs than others, right? Really complicated algorithmic solutions, you’re probably going to mess them up than some really basic straightforward four-loop that does something simple. But, even more importantly, the likelihood of bugs goes up significantly as the team size grows and as the code base itself grows. So we’ll come back to that point a little later in the talk, but just remember that as the code-base grows, you’re going to need more and more tests. And this is pretty similar to a car that has a bigger engine needing bigger brakes to stop you on time.
Second factor is the cost of bugs and here the thing to remember is that there are some parts of your code where bugs, they’re just not that big of a deal. Sure some user might get annoyed, it’s a little bit irritating, It’s not the end of the world. And then there are other parts of your code where you just cannot have bugs, right? In your payment systems, for example, you don’t want to be charging users two-times or zero-times. In security, right, authentication, authorization, you do not want to get those things wrong. That’s a very costly error that might be a company ending event.
So there, you’re going to invest way more time and testing because the cost of bugs is really high. And then the third factor is: how much does it cost to do the testing? For some types of tests, like unit tests, the cost is really low, right? Most modern programming language have unit testing frameworks readily available or even built-in. It’s easy to write them. They tend to run really fast. So the cost is really low and you should almost always write some amount of unit tests.
But integration testing can be more expensive and U-I and end-to-end testing can be very, very expensive and sometimes the cost of the test is higher than the cost of the bug. Like, it would have taken you five minutes to fix it. It takes you and you know, no users would’ve really complained, whereas it would have taken you five hours to write the test. In those cases, it actually might make sense to skip the test or to reduce the test to just a small number of high-value ones. So those are the key trade-offs. But, if you do a good job of those trade-offs, so you are doing continuous integration, everybody’s regularly merging into the same branch, and you have a self-testing build that basically runs tests after every commit, and rejects things that fail, there’s something really powerful that this safety mechanism does, which is you go from the world of late integration where the default state of your software is that it’s broken, right? The default state is you just assume whatever code you have and all these feature branches, it doesn’t probably work.
And it doesn’t work until you do weeks and weeks of effort to merge it all together and then somehow manually prove that it works and it’s kind of an awful process that actually slows teams down considerably. If you do continuous integration, now there’s this incredible shift where the default state of your code, assuming you have good test, the default state of your code is that it works and you can deploy it anytime you want. You can deploy 10 times a day, 1,000 times a day, and that’s really the key. That’s why these large companies do trunk-based development is with a good self-testing build and everybody merging together regularly, you can deploy every day, many, many times a day and really get software shipped very, very quickly.
Okay, let’s move on to the second safety mechanism, which the analogy for them are bulkheads. So, bulkheads are a part of a ship. Usually when you build a ship, you separate the ship into these areas and put these giant walls between them which are called bulkheads. And the idea here is, if you get a hole in the ship, if you hit something for example, and the water starts rushing into one part of the ship, the bulkheads prevent the water from getting into the entire ship and so you have a good chance of surviving that collision. And so basically, damage in one part does not cause a disaster everywhere. The equivalent in the software world is splitting up your code base, so that if you make a mistake somewhere over here in the code base, it doesn’t affect everything.
Now, why do we need this? Well, it turns out, and it’s a little bit weird as a software engineer, but the more code you write, the slower you go. So this is one of the things that actually slows you down, is more code. In the book Code Complete there were some great, there’s some great research done around this, and what they did is they looked at the number of bugs relative to the size of a project. Now, of course as a project gets bigger, you expect there to be more bugs, but what they looked at was actually the bug density. So that’s a number of bugs per 1,000 lines of code and what they found was that bug density went way up as project size increased.
So for example, if you had a project that was less than 2,000 lines of code, you’d expect there to be between zero and 25 bugs per 1,000 lines of code, but by the time the project reached over half-a-million lines of code, now, we were looking at between four and 100 bugs per 1,000 lines of code, right? 100 bugs per 1,000 lines of code. That’s every time you write 10 lines there’s a bug. In another ten lines of code, there’s another bug What that means, is as the code base grows, the number of bugs, the density, actually grows much faster. So, bigger code bases are going to be much buggier, which means you’re going to go much slower if you don’t do something to solve this. Now, the reason for all these bugs, like why would a bigger code-base have higher bug density? The reason is that we don’t really do software development in an I-D-E, or on a chart, or in some tool.
It’s really happening in your head. That’s how you code, right? You build some mental model of what’s happening in the code base in your head, then you figure out how to modify, and then eventually you put that into the I-D-E. But the real work is happening in your head. The problem is, our minds can only handle so much complexity, right? We just can’t handle it when we’re over half-a-million lines of code. You just can’t fit all of that into your head. You can’t consider all the ways the different parts of the code-base interact with each other. So you start having more and more bugs and you start going slower and slower. So, to solve that, you want to split up a code-base and specifically what you want to be able to do is, let’s say you have a million lines of code.
You want to find a way to organize things so that you can focus on one part of that code-base at a time and safely ignore the rest, and I do mean safely. So, obviously you can always ignore the rest of the code-base and make random changes, but then you create bugs all over the place and that makes you actually go a lot slower. What I’m looking for, is a way where I can ignore the rest of the code-base, while looking over here and be confident that as long as this little universe that I’m looking at is okay, that everything else will be fine too. And so there’s two primary mechanisms to accomplish that, and one is to move to versioned artifacts and the second is to move to services. So let’s look at these. So versioned artifacts.
You could keep everything in one repo and just publish artifacts. And artifacts are really the key difference because when Module A no longer depends on the source of Module B now, you can modify the two of them independently because they’re essentially looking at these like frozen in time versions of each other. And that has some really nice advantages. And by the way, we already do this all the time, right? This isn’t some like new crazy thing that I’m suggesting, you do this all the time. If you’re using any open-source or third-party libraries, you’re probably not depending on the source code of those libraries directly. You’re probably pulling them in at some specific version. So, you know Google Guava 18 or React J-S 16.5, you’re looking at a specific version.
And that open-source project is able to develop itself completely independently and go as fast as they can and not have to worry too much about you, and you can develop your own project without having to worry about breaking react, J-S or guava, right? That’s how we use open-source and third-party libraries already. You can do the same thing for your code-base inside of your own company and that has some nice advantages. One is isolation, right? The ability to work largely independently from the other parts of the code-base and even to ignore those fairly safely. The one place where you can’t ignore them is your public A-P-I.
So for example, if Module B over here exposes some A-P-I and Module A is using it, you can’t just change that willy-nilly, you do have to think through backwards compatibility and what happens if Module A eventually updates to the new version of Module B, but still all the internals of Module B, you can build by yourself and you can make backwards and compatible changes as long as you provide a reasonable migration path to the new version. So isolation is good. You can go faster within your one module at a time De-coupling is an interesting side effect. If you have a large code-base and you start breaking it up, you’ll often find that things are really tightly coupled together. You know, I like to think of it as like pulling a wire out of a box of wires, right, and everything seems to come up with it as well.
Breaking that stuff up actually has huge benefits, a lot of the bugs and issues that you’re often running into are because of the code is unnecessarily coupled together. And so breaking it into these artifacts forces you to split it up, and often has some really nice benefits in reducing bugs and issues in cleaning up A-P-Is. And then the third thing is another fun side effect: your builds get faster. Instead of having to build this entire code-base every time you make a change. If you’re changing Module B. You only need to build the code that’s in Module B, which is a really nice advantage. But, there are drawbacks. So, the first one is really important.
Those of you that have been really paying attention in this talk hopefully noticed that what I’m discussing here is more or less the opposite of continuous integration, right? In section one I said continuous integration, merge everything together on a regular basis, and now here I’m saying split everything into these artifacts so that you can do a 1,000 commits in Module B and the people in Module A will never see those until much later on, and hopefully what you’re realizing from this is there aren’t silver bullets here, right? You have to pick the right tools for the job.
In some cases continuous integration is the best fit and source dependencies and everything working together, in other cases these sort of versioned artifact dependencies are going to be a much better fit. Usually the way that breakdown works is where are you spending your time? If, for example, these things are completely separate from each other, right, Module A is maybe a whole separate product or it’s a separate library that you could actually potentially open-source into the world, then separating that into a versioned artifact makes sense because they’re going to be developed separately, you’re going to be doing most of your work within Module A and a separate team will do most of its work in Module B. And so yes, they have public A-P-Is and how they interact, but that’s really the only interaction.
In those cases versioned artifacts will give you some advantages over continuous integration. But if this whole thing is one product that’s deployed together and versioned together and tested together and you do basically everything together, then separating into these versioned artifacts will actually be a really bad trade-off and you should instead stick with source dependencies and stick with continuous integration. Other drawbacks to versioned artifacts. You do get a little bit of dependency hell. ,here’s a lot of ways that this works out but for example, let’s say Module A depends on B, and also has a direct dependency on E. And let’s say it depends on E version one. Module B depends on E version two. So now when A pulls in B and E, which version of E should it get? One or two? Depending on the language and the framework and the tooling you’re using you’ll get different answers to that and different bugs as a result.
And you can run into all sorts of issues like diamond dependencies, you can run into circular dependencies and just you know, this used to be called D-L-L hell, there are a bunch of weird things that happen when you break up into these versioned artifacts. And whether it’s worth dealing with them or not depends again on the type of software your building. Finally, more or less by design, it’s much harder to make global changes, right? If you needed to update every single one of these modules, maybe there’s like some security thing that came out that’s going to take a long time if they’re all separate versioned artifacts and they all have interdependencies.
You basically have to build like a dependency graph. You have to start at the bottom of the graph, update the lowest layer, release new versions of those, then you go up one layer, update everything in that middle layer to use the newer versions, release new versions in the middle layer, and so on and so forth, and it just takes ages and ages. So if you have to do global changes across this the set of modules often versioned artifacts are not going to help you. They’re going to slow you down. But if global changes are extremely rare and 99.9% of your work is local within a module, then it’ll actually make you go faster.
Okay, so that was one way to split up a code-base. The second way is to use services, or what these days have become known as microservices. I don’t know why that’s the cool new word, but we’ll go with services for now. So what’s the idea here? The idea is normally when you start building an app it is a model, and I don’t say that as a bad thing by the way, and you’ll see why in a minute. But it’s a monolith. It’s a single app. You deployed essentially as one process and all the different parts of that app talk to each other through function calls in memory function calls.
As you grow, you might want to break this down into microservices, and so now each part of your application lives in a separate process, usually runs on a separate server as well, and now instead of communicating through function calls, they communicate through message-passing, usually over the network. So for example, these might be H-T-T-P calls that pass .json data around so that’s the idea with services, as you move to the sort of network-based architecture. There are some advantages to this, one is, once again, you get isolation. So you could have one team that owns this Service A and other team that owns Service B and they can more-or-less work independently from each other within their own little service worlds. Again, the exception is the public A-P-I.
In fact, we’ll talk about that in a second, but the public A-P-I with services is even harder to update, but other than the public A-P-I, you can more-or-less do what you want to have your own coding practices and go at your own pace within each of these, which is a really nice advantage especially for larger companies where you want teams to be able to run at their own speeds. Second advantage is services are technology agnostic. Since each of these things is typically a separate process on a separate server, you can build them using completely different technologies. This one could be Java, A could be Java, B could be Python, E could be Node.
You can use the best tool for the job. Also, that’s useful if you’re acquiring companies that may have used a different technology than you. And then the final advantage is scalability. Services allow you to scale each one differently. So for example, maybe Service A can only be vertically scaled, so you just have to keep giving it more C-P-U and more memory. Whereas Service B maybe that’s easy to horizontally scale and you can just spin up a whole bunch of little servers and scale it that way. And by having them as separate services you have that ability, whereas if everything was in one monolith you’re basically stuck at the lowest common denominator, you’d have to scale everything vertically essentially.
So, those are some really nice powerful advantages of services, but they also have a ton of drawbacks. For one thing, you have a lot of operational overhead. Instead of having one thing to deploy and manage, the monolith, you now have “n” things. One, you know for each microservice. In each one you have to deploy it separately, configure it separately, monitor it separately, do security patches separately and so on and so forth. Everything gets multiplied. There’s a huge performance overhead. Services are better in some cases from a scalability perspective, but they’re generally much worse from a performance perspective. And the reason for that is we’ve switched from function calls in memory, to calls over the network. And if you go look up your latency numbers, you’ll see that network calls take two orders of magnitude longer than in memory and sometimes more.
So we’re talking something that you stick nanoseconds now takes an appreciable chunk of a second. We’re talking thousands of times slower. And so if you just try to naively switch to microservices, your code gets really, really slow. And so then to fix it you have to rewrite a lot of the code. You have to think about batching and caching and then you start dealing with things like thread pools or maybe non-blocking I-O, which is a different programming model. You have a whole new set of errors to deal with, right? A function call usually just works. A network call could fail. You might have to retry it, it could be slow, you could get a half a response back. There’s all these new failure modes. I mentioned this earlier backwards compatibility is another big drawback.
If we go look at this diagram if Service B exposes some public A-P-I that A uses you can’t just change that A-P-I. You can’t just delete for example the A-P-I or change some parameter because as soon as you do that, since these are live services, talking to each other, A will start getting errors. So the way you evolve A-P-Is in a service architecture is much more complicated and expensive. So for your public A-P-I, you’re actually likely going to go slower. But if most of your work is internal and the public A-P-I is pretty consistent, then you might go faster. And once again for the same exact reason by design it’s harder to make global changes. So splitting up a code-base, lot of advantages, several different ways of doing it, lot of drawbacks. So just make sure that you’re making the right trade-offs versus with having the code-base split up, versus having everything together and continuous integration.
Alright, third item we’ll go over is autopilot. Some cars these days and a lot of airplanes have autopilot to basically automate the things that the car is doing or the plane is doing, and the idea here is to remove people from the equation because human beings make mistakes all the time, and you don’t want to be slowed down by mistakes. Also, humans aren’t very fast at doing things whereas computers can do things very quickly, very accurately without mistakes. So the equivalent of autopilot in the software world is the automated deployment. The idea is to remove human beings from your deployment process. That’s the goal.
We don’t want manual steps in the process, that allows us to do it faster, makes it a lot safer as well because the computer is not going to accidentally make a mistake. So, if you’re familiar with A-D-F code smells where you look at a piece of code and something just really seems off, kind of like it smells, well, there are also smells in the devops world. So one of them I would say is if you see that the way your team deploys things is by S-S-H into servers or manually running a bunch of commands and configuring things by hand. Sometimes called clickops. That’s a smell. There’s something, you’re just, you’re going to have a lot of errors as a result and you’re going to go a lot slower as a result. Similarly. If you see your team members deploying things by going to a web U-I maybe A-W-S. This is the A-W-S console, or Azure, Google Cloud, and they’re clicking all day to deploy things.
That’s also a smell that’s going to be slow and error-prone. What you want, the deployment process you should be aiming for, is this, it is a single, big, fat deploy button. You click it and that’s it, you as a human being, your role is completed. The rest happens automatically. In fact, if you want to get really fancy, you might even get rid of the deploy button, right? You might just deploy automatically as soon as your continuous integration and automated tests have passed the build. So, as little human involvement as you can get away with, that’s the goal. Now to do that, you have to automate things and you have to automate a lot of things.
You have to automate where the infrastructure itself how that gets configured, the configuration of your apps, the actual deployment itself, and so on and so forth. So there’s an awful lot that needs to be automated to make this happen. This is the investment for this safety mechanism. So I’m going to go over some of the tooling in the space that may be useful for automating these things. And I’m going to go over this roughly in the order of how these tools were developed historically, and so the ones towards the end are the more modern ones that you probably should be using. So, the first category were ad hoc scripts when people first decided, okay, I need to automate the deployment of my software, you turn to your favorite scripting language whether that’s Bash or Python or whatever else and you just write a whole bunch of code to automate whatever that process is.
So here’s for example a simple Bash script that you can run on a Linux server to install some software on it. Now, the advantage is these are general purpose programming languages, so you can do whatever you need. The drawback is these are general purpose programming languages and you can do whatever you need. If you’ve ever had to maintain a large code-base of scripts for automation, especially Bash scripts, you’ll find that it’s very, very painful, you constantly have bugs, everybody writes the code in a different way. Most people don’t take into account some of the really important concepts that are essential for managing infrastructure deployments, state management, item potency. People miss these in these ad hoc scripts because that they’re general purpose tools so you just you have to be aware of these things and it takes a long time to learn. So generally speaking, these should not be your primary tools.
You will use them. There’s all sorts of glue code and stuff that you’re still going to be doing with these general purpose tools, but these probably shouldn’t be your primary option for managing infrastructure deployments and configuration. Now, a lot of people realize this, so the second set of tools that we built out in the world, what are called configuration management tools. These are things like Chef, Puppet, Ansible, etc. And these were purpose-built for configuring the software that gets installed on a server. So here for example is some Ansible code. It’s Yaml for doing something similar to that Bash script, that’s basically installing some software on a Linux server.
Now, the advantage of these tools is they are purpose-built for configuring servers, which means they have a lot of tooling built-in. So your code is a lot shorter. They have a bunch of patterns that you can use, so that it’s not just completely random. There’s certain expectations, you can have about the code-base, And they solve some of the problems out of the box that people forget to do when they’re just using general-purpose tools, like item potency. Like, don’t install the thing second time if it’s already installed on a server. The drawback to these tools, they’re certainly better than just ad hoc scripting, but the tools themselves are pretty complex. It’s an extra thing to learn, many of them require you to run extra infrastructure. So like a Chef Master Server or Puppet Master Server or multiple servers. They required to open all sorts of ports and be able to connect to things.
You have to think about authentication and encryption a lot more, and one of the biggest issues I think is, most of these tools were designed to configure your production environments, but they kind of left your dev environment, which is where developers spent a lot of time, out of the equation. Very few people use these tools in dev so you didn’t really have a good parity between what production had and what you were doing in dev. So, the next layer that people developed were machine images and this is something that I think is extremely popular today. And I think this is what we’re mostly using in the modern world. And there’s different types of machine images. You can have virtual machine images and you can also docker images and there’s a variety of tools you can use to build these things. And so this is a bit of a mindset shift.
Instead of using a tool like Chef to go and configure that server and then that server, and that server, you basically just create a machine image. You create a single image that represents everything you want already installed and configured and then you can take that image and you can run it on all of those servers and you can also run it in the dev environment. Those are the big differences. So here’s an example of a Packer template that can be used to build an Amazon machine image, virtual machine for A-W-S and it installs a bunch of software on it and now you have this like immutable hermetically-sealed little artifact and you can now deploy it all over the place. So the strengths are these tools tended to be a bit simpler to use than Chef and Puppet.
They gave you these immutable versioned images, so they were a really effective way to get into immutable infrastructure, just a whole bunch of benefits, and you could run these images in every environment. Dev. Even your own laptop, you can run a docker image on really easily. You can write it in the Q-A environment, staging and prod, so they gave you good parity across all your environments. So that’s why these are very popular these days, especially docker. Now, the drawback is there are extra layers of abstraction, certainly running a virtual machine is you have to virtualize the whole operating system and hardware. So that has all sorts of performance implications. But even more to the point, these things, these tools are very useful, but they don’t solve the whole problem.
Just because you have a machine image doesn’t solve everything. For example, how do I get the underlying infrastructure? Where’s my server come from? Something still has to solve that, and then even once I had the server, how do I take my image and put it on the server and keep it running there? So you still need to figure out infrastructure and orchestration. So that’s where the next few tools come through. So, we have a set of tools that I called provisioning tools. These are for managing the infrastructure. So these are tools like Terraform and Pulumi and what they let you do is spin up all of your servers, configure your network, and your load balancers, your databases, all of those basic hardware, some of which may be virtualized in the cloud. These are the tools that are custom-built to manage that stuff.
Here’s an example of some Terraform code that deploys an E-C-2 instance, basically a server in A-W-S and attaches a static I-P address to it. So it’s this very simple decorative language for doing these things. And so the advantages, these are purpose-built for managing infrastructure. Doing it with ad hoc scripts is hard and not fun, doing it with configuration management tools, some of those had some first-class support for this, but they did it very poorly. These tools are purpose-built for managing infrastructure, and they do a really nice job of it, including handling a very hard problem which is, to manage infrastructure you have to maintain state. You have to remember, what did you deploy before so that you can update it in the future.
The drawbacks to these tools, they’re new. These tools have only come out in the last few years. They’re still fairly immature. They still are missing a lot of the features you want. They have a bunch of bugs. Eventually, they’ll get better. But right now they’re still pretty new. They also introduce their own complexity. They’re new tools. There’s sometimes new languages. So learning how to do and manage these things is not always easy. Final category of tools are orchestration tools. These are things like Kubernetes, Mesos, E-C-S and Nomad. And these are designed specifically for managing apps.
So they assume the infrastructure is already in place. Maybe you used to Terraform to spin up a Kubernetes cluster and then these tools will take your machine images, those docker images and B-Ms and they will deploy them across your hardware and they’ll monitor them, and they will redeploy them if they crash, and they’ll do rolling deployments, and a whole bunch of the other things that you need to solve to run apps in the real world. So here’s an example of code for Kubernetes, which is YAML, which says okay, I want to run a docker image that has NGINX installed and I want to run it at a specific version.
This is this immutable infrastructure idea. I want it to listen on port 80, and I want to have three copies of it somewhere in my cluster. So this very nice decorative language for capturing all of that complexity. So strengths: these tools are built for managing apps and they’re very good at it. They’re going to do a much better job of it than you would with ad hoc script or configuration management tools, and part of the reason they’re so good at it is, they maintain state. Again, they remember what you deployed before, how many copies of it you want to deployed, they monitor it, they solve all of these very important problems with managing apps.
The drawbacks: these are probably going to sound familiar. These tools are all relatively new so missing features and bugs are to be expected, and these tools introduced a lot of their own complexity learning something like Kubernetes. It’s like its own cloud. So you just have to take a lot of time to really understand it. So, key point with all of these tools is they allow you to define and manage all of your infrastructure as code, and that’s an incredibly powerful safety mechanism because with code you can version it, you can code review it, you can write automated tests, you can have continuous integration, you can reuse the code, you can apply all the other safety mechanisms we’re talking about to this code as well. So that will let you go much, much faster.
Okay, final piece we’ll talk about and this is what’s called the safety catch. So I’ll explain what that is. Back in the 19th century, we had invented the elevator, but nobody, no human being was really willing to use it. And the reason was people were deathly afraid that if the cable snapped the elevator would plunge and you would die. And Elisha Otis invented what is called the safety elevator, and had this amazing demonstration for it where in front of tons, tons of people he had this giant open elevator shaft. You can see in this picture here. And he rode way, way up, was up really high, and was standing on his little elevator and then he had assistant up here, cut the cable in front of the whole audience. And the elevator dropped, but only a little bit and then immediately came to stop and Elisha was completely fine.
So, how’d the, how did the safety elevator work? And by the way this thing transformed the world, this is what allows skyscrapers, this is what made people confident in the elevator. So here’s an image from the patent for the safety elevator, and what we’re looking at here is kind of a side view of the elevator shaft. You can see the elevator in the middle of the shaft. And if you notice along the sides of the shaft, there are these metal teeth that stick out, and in the elevator itself, there are these metal safety catches that stick out. And here’s the key point: by default these safety catches, their position is out.
So they stick out into the Elevator shaft by default. And because of that they catch those teeth and the elevator can’t move at all, and the only way to pull those catches in, and allow the elevator to move is if somebody pulls up with enough force on the cable. So, only when there’s an intact cable do those catches get pulled in and can the elevator move. And if the cable snaps, they pop right back out and the elevator comes to a stop. So here’s the key about this idea. So I think this is actually really cool invention. It’s very clever. But to me what strikes me about it is these safety catches, they make the elevator safe by default.
It’s not some extra safety mechanism that jumps in at the last second. It’s actually safe by default. That’s a really powerful concept that I think we should copy it a lot of engineering, and one of the ways you can copy it in software engineering is what are called feature toggles. Feature toggles give your code some degree of safety by default. One of the reasons you might want to use a feature toggle, by the way, is this question: so often when I talk about trunk-based development, which I was talking about earlier in the talk, one of the questions that comes up that I didn’t answer then is let’s say you were building a new feature that was huge.
It would take six-months to a year to complete. How do you commit that to trunk all the time? Right, if it’s not done you don’t want to commit it and have it shipped to users. Well the answer to that is the feature toggle and it’s actually really simple, I’m sure you’ve invented it yourself in the past. So let’s say this is the code for some app you’re building and at the bottom in this H-T-M-L, we have the original code for our website, and then at the top, this is that new feature your building that’s going to take six-to-twelve months to complete. Well, what do you do so you can check this in without users seeing it before it’s done? Hopefully the answer is pretty easy. You put an “if statement” around it, right? Nothing fancy, wrap it in an “if statement”.
Have the “if statement” look up a feature toggle. And here’s the key: by default that feature toggle will return false. In other words, this feature will be off and so this if statement will evaluate to false and this new section will not be visible to any users. So with this tiny simple little “if statement”, now you can commit this code even before that feature is done. The code still needs to compile. It should be syntactically valid. So, if you, kind of there’s some bare minimum that needs to be working, but the whole feature doesn’t have to be complete, tt doesn’t have to be working, doesn’t have to be pretty because no user will see it.
That’s the key. So it’s safe by default. And this does something pretty magical. If you wrap all of your new features in these “if statements” in these feature toggles that are off by default, well, what you’ve done now is you’ve separated the act of deploying code from the act of releasing new features. Now, you can take your code and deploy it all day, every day, every server around your entire fleet. And none of the new features will be visible until you separately turn them on by flipping that feature toggle. And this is like a super power to have, this is an incredible safety mechanism. So how do you turn feature toggles on-and-off? There’s different ways to do it.
One of the ways is to just have a config file which you probably have for your app, and in some environments, maybe dev, you turn the feature on so you can code it, and then in other environments like production, it’s off and oh by the way it’s also off by default so I just list this just to be explicit so it’s more clear what’s going on. Configuration is good and that’s probably a good initial step for a company to do, but the next level up from that is even more powerful, which is you create some sort of a service maybe a data store where you’re storing the data for these feature toggles. So you can ping it and say hey should this be on in Environment X and should it be on in Environment Y? Even more importantly, if you have a service like that, you can actually return different results for different users. So maybe for user 1-2-3 you turn the feature on, but for user 4-5-6 you turn it off.
So now you have this really powerful ability, where if you put a little web U-I in front of this feature toggle service, now, you can turn features on-and-off dynamically after the deployment has happened, right? So this is how you release new features as using a web U-I and you can turn them on-and-off for specific users. So this is a screenshot of a tool called Excellent, from when I was working at LinkedIn and this was a U-I that we used to turn feature toggles on-and-off, and so here I can turn this feature toggle, show new homepage module on for 1% of users in the U-S, as an example. And this is incredibly powerful because now I have the ability to quickly turn things on-and-off whenever I want to. And the way we use that was like this: all new features were wrapped in a feature toggle, off by default, so we could deploy them anytime we want it.
When we thought the feature was ready for use, we might turn it on and maybe initially we just turn it on for employees of our own company, so the rest of the world doesn’t see it, but our employees start testing it. If things seem to be working well, now we can turn them on for public users maybe to 1% of users and we look at the logs and we look at the metrics and we see is it working? Are there any issues? If not, now we ramp it up to 10%, 50% and eventually at 100%. If at any point we had an issue, we have this unbelievable safety mechanism where in a couple clicks we can turn that feature off again. And, sure users aren’t going to be thrilled that they lost access to some feature, but nobody has to be woken up in the middle of the night, we don’t have to rush and work all night to fix some severe bug. We just shut it off and we’re we revisit again and fix it when we can later on.
That’s a really powerful ability to dark launch things, to ramp them up slowly, and to turn them off again. One of the other things you can do with feature toggles is A-B Testing, or more generally bucket testing, where you can show different versions of your product to different users, and see if one version helps your metrics or if one version performs better the way you expect it to, so you can do data driven development. So feature toggles are really, really powerful safety mechanisms. There’s some nice tools out there that you can use to help build those web U-Is, and those data stores so you don’t have to build them from scratch. There’s Split I-O, LaunchDarkly, a bunch of others. So check them out, okay? So those are the safety mechanisms I wanted to go over.
There are of course many others, but these are four of my favorites. Breaks. Bulkheads. Autopilot and Safety Catch. And just to recap: brakes were continuous integration and automated tests. These are what stop broken code from getting out into the real world and doing a lot of damage. Bulkheads were how you separate different parts of your code-base so you can focus on one part at a time and safely ignore the rest and you can do that by using versioned artifacts, or you can do that by using services, or both.
Autopilot, this is infrastructure as code. This is the ability to automate the deployment of everything you’re doing. Automate your infrastructure, automate the deployment process, automate the configuration. Capture all of that as code and let the computer do it instead of a human being and you will avoid many, many errors and it will run a lot faster. And then the safety catches. These are the feature toggles. These allow you to separate deployment from release, these allow you to dark launch things, to ramp them up gradually, to shut them off if there’s any issues.
A really, really powerful safety mechanism. So, to recap things, speed is limited mostly by safety I think in the software world. If you want to go faster, you do need to think through these safety mechanisms. If you feel like your team is just not shipping code fast enough, think about what happened. Why? Right, what’s slowing you down? In a lot of cases it’s that when you go faster everything breaks and then you’re slow again. So, you really need to think about these safety mechanisms and it’s worth the time to put these things into place.
Basically don’t turn into this team, right? Take the time, put these in place, you’ll end up going faster. If you want to learn more, my two books talk about these concepts quite a bit. So Terraform: Up & running and Hello, Startup. If you need help with any of these infrastructure and safety mechanism things, feel free to ping us at GruntWork, and that’s it. Thank you very much.