Towards Application Driven Infrastructure


The evolution of infrastructure tooling is reaching the point where we can move beyond automatically provisioning infrastructure from static definitions, to dynamically generating infrastructure to fulfill the requirements of the applications that run on it.


  • Kief Morris
    Global Director of Cloud Engineering, Thoughtworks
Show video transcript

Hi, thanks for joining me to talk about Towards Application Driven Infrastructure. So what I mean by this title is I want to have a look at, a quick look through kind of the history of infrastructure as code, where it’s come from and particularly in relation to some of the kind of trends and things that I think are relevant to where we’re going.

So then we can talk a little bit about where are we now in terms of infrastructure technology and kind of tools and languages and so on for defining infrastructure right now. And then where does this give us the opportunity to go to, which I think is an interesting bit, and that’s kind of what I mean in the title of application driven infrastructure, is I think that the trend is now towards giving us the ability to kind of take a more application-oriented approach to how we design and build our infrastructure. So I think the important thing to keep in mind is that we’re on a journey, I think by, keep in mind, I mean like generally as we kind of work with infrastructure and work with the tools, is like these tools are not quite, and the ways of working, I don’t think are quite where they’re going to end up.

I think we still have a lot of lessons to learn. And then particularly one of my my themes, my focuses, is that I think we need to draw more on practices and patterns and principles from, from software engineering and software design. And we need to work on how to apply these to how we manage and define our infrastructure when we’re using code and languages, because I don’t think that currently, I don’t think we’ve really fully brought those in. I think there’s a lot of lessons that have been learned in the software world that we haven’t really applied to infrastructure yet. So this is why I wrote the book Infrastructure As Code.

The first edition came out about four-and-a-half years ago. The second edition is due out at the end of this year. And, so that you know, again the idea of this was to, was to talk about ways and think about ways to use code in defining our infrastructure based on design patterns and experiences of software delivery, particularly agile ways of working and agile engineering and you know, so things like test-driven development, and continuous delivery, are a big part of how I approach thinking about infrastructure as code.

I work at ThoughtWorks, I’ve been at ThoughtWorks for about 10 years. I’m the Global Director of Infrastructure Engineering, which basically means I work with teams and clients kind of around the globe exploring ways of using cloud more effectively, the ways of doing infrastructure and running projects and again these engineering practices. So I’m kind of drawing on a lot on what I’ve seen from, from working with different teams and clients. So to go into a brief history of infrastructure code. It started out, we originally kind of focused on server configuration as what to do with code and scripting and so on because that’s where that’s kind of where the action was. So that’s where we spent a lot of our time and energy was on, you know, installing applications and upgrading applications, and all kinds of things around on servers and also is easiest place to run code, you know, that you could use to manage, manage things on the server.

That was kind of a natural thing to do, whereas things like, say networking devices, and storage devices, and so on, we’re a lot harder in the early days to really apply code to and to do that effectively. But then kind of with the advent of cloud, to start with virtualization, and then particularly with the cloud when we started having A-P-Is that we could use to manage a broader part of our infrastructure, it meant we could, we could, our kind of focus shifted up and this is where we started looking at how to define as a infrastructure stacks of collections of networking, and storage, and compute, you know combined together.

How do we manage that as code? So that’s been one kind of trend in terms of the flow of infrastructure as code over the years. So kind of going and looking at particular in servers, like what do we do in the early days with servers? We tended to write a lot of scripts, right? And these are imperative scripts. So kind of functional things like shell scripts, Perl, Python, whatever. And these tended not to be sort of before the term infrastructure as code was coined, and it was because we didn’t really use these things in the kind of holistic and comprehensive way that we do these days with infrastructure code. What we did was more task-oriented. So we might have a script that runs in the Cron Job and reports on disk usage. Or we might have another script that we use that we can we can run to go and you know install some bit of software and, and put some configuration files in place.

So it’s very task-oriented in those early days. And so then kind of in the early-to-mid, so I guess the mid-2000s, around 2006-plus, when infrastructure as code as a thing, kind of emerged. So actually C-F Engine was the first tool that kind of implemented this approach and it was done in the nineties. So Mark Burgess, you know, created this tool and really pioneered the whole idea of infrastructure as code even before that name emerged. And so it was in kind of 2006, 2009, infrastructure as code, dev-ops, cloud, like all of these things, kind of emerged and they all kind of really complemented one another to drive this.

And so this is when we saw server-oriented configuration tools, like Puppet, Chef, Ansible, SaltStack, these kind of emerged and let us do this kind of thing more, you know, better basically. And the approach of these tools tended to take was they were declarative, so they would kind of let you state like, this is what I want to have in my system, so I want to have this software package installed, I want to have this service running on these ports, if I want to have some files in place and here’s the permission. So it was very much, you know, what I want to have, as opposed to how to do it. So those scripts that we used to write we’re very much step-by-step, do this, that, and the other. And then the other things, one of the other kind of characteristics of these tools, was that they tended to use domain-specific languages, D-S-Ls, so they invented a new language that was very stripped down to focus on again, you know, what are those concepts you need to declare for a server to expose those as language concepts, and to not have very much else in there.

And then a third characteristic of these types of tools and languages was that they were idempotent so you can run the tool over and over again, and the idea was that it was meant not just to carry out a task, like installing a server, but also kind of keeping the server configuration in a known state to match what you’ve declared. And so a lot of these themes have kind of continued in tools since then. Another kind of, I guess, branch in this evolution of managing servers has been immutable servers. And so the idea of the immutable server is that when you, you want to make a change, the configuration on that server, rather than running a tool which, which changes in existing server instance, you know, it’s not allowed to do that, or you know, you don’t do that as a practice, instead you apply the configuration change to a new server and you remove the old one and kind of swap them out.

And this is just to kind of the idea of this approach was, is around consistency to say that we don’t want to make changes to a running server, you know, which could potentially have an error. Instead, we want to be able to test it before we kind of load traffic onto it. And so, the kind of, the relationship with this to go, there’s there’s a couple of aspects to it. One is, that kind of simplifies, potentially simplifies the kind of code that you use, so those declarative tools, the, you know, Puppet and Chef and those types of tools tended to focus on, where they were written for the case where, we’re not really sure what the starting state of the server is when we run this tool. It might be a fresh server with a certain version of an operating system on it. Or it might be a been a server that’s been running for awhile, which may have had, you know, an unknown previous version of the, the code applied to it.

And so the tools designed to kind of handle a slightly complex scenario which when you create immutable servers isn’t necessarily as complex. And particularly, one of the kind of big use-cases for these kind of services is more simplified cases, like creating docker hosting nodes for a container kubernetes cluster, right? And in these cases because the server is very much simplified, say you can start with a very basic operating system you installed docker or maybe a couple of agents and you don’t need to do too much more. And from a known starting state having like a set of simple scripts to install the different, different aspects and configure the different aspects is actually more practical. So it kind of simplifies the configuration tasks. And I think one of the big influences of this, this mindset of immutable servers, has come with containers, what we’re creating a very stripped down and simple server and a server image, you can focus on creating the image and you don’t really within a docker instance, for instance, you don’t tend to make changes to the configuration within that running image or instance instead. You create a new image, new version and you kind of push that out.

And so this is the kind of immutable idea is, kind of lives on in that level. So I think a little bit about how infrastructure stacks have evolved. So again, we started out with writing imperative scripts. So we wrote them in things like Python with maybe the Boto Library, Ruby, in Fog and so on, so you would basically write a procedural script, you would use a library which lets you interface with the A-P-I, if the cloud or infrastructure platform, and then you would write the logic of how, as well as what. So this was where you would have to write the logic to say for instance, if I’m running a script for a server is going to create a new server every time I run it? Or what if I have an existing server that I want to change? Is that script going to know do I have to implement logic in there to decide, you know, what to do, whether to change an existing server and how to handle those different cases?

So there’s a fair bit of kind of things that you needed to do in that script and it wasn’t that hard to understand by looking at your code. You know, what was the thing versus how you’re creating it? And so, there came the kind of, the tools like, Terraform was the first one of these tools, that you know, that I came across and started using, which was similar to what happened with the tools like Puppet and Chef back in the day for server configuration. Now these were tools which abstracted out, split out the, defining what you want to have in your infrastructure and then letting the tool manage how to make that happen.

So again, you would have a declarative language, usually a D-S-L. And that language would use that to express the here’s what I want my server to look like, here’s what my what my networking to look like, apply it, and then let the tool work out, you know, whether to create a new one or change an existing one,, how to handle error scenarios and waiting for things to provision and all that kind of stuff. This was a big step forward and it’s been kind of where the kind of mainstream, I think of infrastructure, coding has been up until now. There’s been another branch in all of this, which is the kind of containers like the you know, the kind of cloud native world. And so this is the idea that we can focus on our applications, right, which is is a really valuable thing to do. So, it’s the idea that you know, we build our applications, package up in a way that we don’t really care much, don’t need to care too much, about what the infrastructure is. That’s kind of abstracted for us.

The important thing is the infrastructure still exists, even with serverless, you still have servers, and you, you know, underneath if there hidden away from you. And you probably have other infrastructure that you require, things like, you know, networking, so you, if you’re going to have requests coming in to trigger your serverless code. I’m talking about function as a service here, or like maybe some storage to store and read data message bsses and those kind of things. So there tends to still be some infrastructure that you need to configure. The nice thing is that you’ve got a nice kind of contract, a nice kind of like division between those, so that it simplifies writing and packaging applications. And it also kind of simplifies how you kind of build the infrastructure and platform that’s underneath because you don’t need to worry so much about what versions of say run-time executables for like Java or Ruby or whatever, you know, that’s kind of you know, that that’s been kind of separated.

Those concerns have been nicely separated. People often ask me about with this world, with cloud native, do we not need to worry about infrastructure as code anymore? Is that no longer a thing? Then, as I mentioned, because you still need to provision some things or somebody needs to provision some things, like who creates that kubernetes cluster, and then you know who manages that, we tend to still do that using some form of code at some level. Even if we’re using a cluster provided by cloud provider like A-K-S, E-K-S, one of those kind of things, or whether we’re kind of using a package to install the cluster on to the, to servers that we’ve created. We still have work to do. We probably still do define that as code.

That’s where my focus tends to be when I when I think about infrastructure as code for these things. So where we’re kind of at now with our sponsors Pulumi and other tools like Cloud Development Kit, and other tools like Cloud Development Kit, there’s kind of a new paradigm which is writing our infrastructure as software But the idea here, in a way it kind of looks like we’re going back to, oh we’re writing procedural code in an imperative language to define our infrastructure, and so our code maybe has to do a little bit more of the how things happen. But I think this generation of tools are a bit different in that they provide a lot more of the basics, under the covers, so that you still focus a bit more on what you want.

And they also bring some new things to the table, which is the idea to define and provision and create infrastructure dynamically, which is where I think some, some cool opportunities come from. The question is, is this the end of a declarative infrastructure? Like is it all going to be you know, these kind of tools now, we’re going to go back to using general-purpose languages, you know, we’re going to use typescript or JavaScript or what have you, to write our infrastructure code and we’re going to do a procedurally rather than declarative? And I think the answer is actually there’s different tools for different jobs.

So one of the kind of threads when people I know, so at ThoughtWorks, we’re very much development and application development-oriented organization so, you know, I worked with a lot of developers who just you know, they’re really gung-ho on the idea of having a real programming language for their infrastructure. And when we look at cases where you know, it is very easy to find infrastructure code that is just absolutely horrific, right? It’s like, especially when you’re doing things like you’re using whether it’s a declarative D-S-L like H-C-L that Terraform uses, or whether it’s YAML or .json or some kind of thing which you know, a toolmaker has kind of crammed in mechanisms to make it programmable, so you have loops and conditionals and stuff in these kind of mark-up languages. And that just, you know, it creates a big mess, right? So I think there are problems with that. But I think a lot of the problems we have with infrastructure code today is that we mix concerns.

So we’re doing multiple things in a single language, a single tool, a single bit of code, and it kind of doesn’t matter which type of language you use or even which language you use, it’s going to be a mess no matter what. So to kind of give an example, there’s a couple of different concerns. When I think about concerns, different things that you tend to need to do in your infrastructure code-base, your infrastructure project, that are maybe, maybe need to be addressed differently, right? So what is defining the shape of an environment? This is where you write some code that says my environment or a part of my environment has you know, these web servers, this is what the web servers look like, and how they’re built. Maybe some host nodes for running docker instances, database nodes and here’s the networking structures around them, right?

So you’re kind of defining what the environment looks like, but then you also want to have multiple instances of this shape perhaps, right? So this is, if you create the environment, or a stack, or what have you, that creates the infrastructure to run an application, you’re going to want to reuse that in multiple environments for dev, test, and so on into production so that it’s built the same way in each of those environments. But then if you’re going to do that, you need to have some differences between those environments, right? So, if you have like a cluster of servers, you’re not probably going to have a many, you know, as big of a cluster in your non-production environment as you have in production. So you need to be able to configure aspects of these.

And I think where we often go wrong with infrastructure as code, is when we embed that into that same code. So you have a code which says here’s my my application server cluster. And here’s the, you know, the base image to use to create it. Here’s the networking structure load-balancer and all these things. And by the way, here’s some code which needs to work out how many nodes to have in my cluster based on certain things, you know, which environment, so, and so that’s where you start taking code which declares a thing and then you start having some logic and cramming it into that same code. And that’s where you get the mess. One of the reasons why you get the mess. I’ll talk about another, the kind of main cases in a moment.

So where I think we’re going, what I think we can do is, is have some different kind of models for how we structure infrastructure projects. And I think they’re different based on teams. So I’ve seen some different kind of organizations and different teams who approach these in different ways and it tends to depend on what they’re doing and who your people are. So, the kind of first model is what most kind of Terraform and CloudFormation and similar projects are, which is what I’m calling a low-level stack definition. And the point here is that those languages, they directly expose the low-level concepts from your infrastructure platform, essentially the A-P-I, you’re essentially a wrapper over the A-P-I to let me declare different things that, you know, the cloud vendor’s A-P-I lets you define, right?

And so when you write, when you define an environment with this kind of language, you’re really going into the details of like okay, what are the networking roots, routing tables, you know, permissioning things, you know it’s very fine-grained and it also tends to be, this kind of project is where you’re defining an environment or the end environment that you want, you’re using, you know, you’re assembling those low-level elements together to create that project. So it’s very kind of thin as well. And so I think the kind of use case for this is when you have infrastructures who are the ones building the environments.

So when you have people who understand infrastructure concepts really well and they want to get down into that level of detail to be able to map things out. That’s, you know, this is the tool for them, right? These kinds of tools are the tools for them going to do the declarative stacks. And I think another benefit of those tools, that declarative language, and the D-S-L is that for them, it really simplifies. They don’t need to know too much or think too much about how to write software, how to, how to do, you know, software design so much and code design.

It really is just kind of stripped down. It simplifies like just to find this piece of infrastructure, this piece of infrastructure, the connection between the two, and that’s it, right? And so another kind of characteristic of these kind of projects when this is the appropriate model, is when the the environments you’re defining, or the stacks, parts of the environment that you’re defining are, tend to be pretty static. They’re not going to vary very much, maybe in terms of some parameters, maybe you’re going to inject in parameters that specify things like that cluster size that I talked about before, but it’s not going to really vary very much what kind of infrastructure elements it’s going to create or too much details of how they’re going to be configured. It’s fairly static. So then another model is a higher-level of stack definition.

And so this is where you’re kind of defining more of a kind of a domain concept and entity, right? So I think about say an application hosting. Alright. I have an application. I want to deploy it on some infrastructure. I will define the things that I need for my application. So, you know, I need to, you know, I might need a virtual machine. I’m going to tell you what kind of operating system, you know, maybe I’m running a Linux server on Windows right? How much memory do I need? Maybe some some details on the traffic. Like how, you know, where are requests going to come into, and then so you’re defining that at the high level and then underneath are the components which dynamically create the infrastructure accordingly, right? So this is obviously where the, you know, tools like Pulumi come in, that let you kind of write that intermediate layer.

So I think the kind of, the way that the usage of this, this kind of model is where you have application developers who need infrastructure and they may not have the expertise in their teams because, you know, if you have like a whole bunch of application teams, not every team is going to have very deep infrastructure knowledge embedded into it and be as kind of not their focus, right? I mean developers tend to like platforms like Heroku or so on, where they can write their code and push it in and they don’t need to really get bogged down into what’s going on and configuring, you know, the infrastructure at a low-level. So it’s a convenience for them to be able to focus on what they need to focus on. And so I think this kind of model will appeal to those types of users.

Then I think underneath that, you’re going to have those infrastructure libraries, components, frameworks, what have you, built by, by specialists. And from what I’ve seen, the teams that end up doing this type of work, tend to be kind of a mix of a bit of infrastructure domain expertise. So going back to that previous model, we had infrastructure experts defining, you know, environments. Now, you’ve got infrastructure experts probably working within these teams who are helping, you know, in how to pull together those different infrastructure elements. So they’re looking at the code that this team has, that these teams have, are you know working out. What kind of infrastructure elements to assemble and how to assemble it. So they know how to do that very well and these things probably also have, the tend to have software development knowledge within them.

So people who are really comfortable with with programming languages, and the tools, and how to test and all that. So these teams tend to blend these expertise. You have some individuals who are strong in both, you have some who come from maybe one side of the other and as they work together, they tend to kind of learn from each other, and build up their knowledge. But this kind of combined kind of thing. And so a note on this, right, so I think one of the pitfalls we’ve seen, one of the kind of sources of terrible, horrific infrastructure code that we’ve seen, is trying to use declarative tools to write modules, right? And so, this is where you say, okay, we we created some different application servers. They tend to have some common code.

We’ll pull those out into like a Terraform module or a cloud kind of, you know, template that we can reuse across other projects. And that model is very limited, right, cause you’re running those modules and again a non, in a declarative language, rather than an imperative language. And so if those modules are just reusing code, if it really is a static thing of like, okay, here’s a bit of code that creates a service that’s pretty much the same every time, that works out alright, because it’s a declarative thing essentially inside that module. But when you start trying to make that dynamic and say well, let’s create the networking depending on different things. Is it traffic coming from public versus internal maybe, maybe we have some different policies. We need to dynamically generate, you know security roles or what have you.

It’s when you’re having to dynamically generate those things to handle different use cases that the declarative code just really, it doesn’t handle it well, and when you see modules declarative modules that try to do this thing, to try to create it’s an abstraction layer for other people to define infrastructure, it just doesn’t work, right? It’s just, it’s just a poor way of doing in it. And so that’s I think another one of the big cases, I mentioned mixing concerns is the reason why, you know, code gets really nasty, infrastructure code gets nasty, and people want to go and use a real language. I think this is one of the other big cases where people are trying to do something more dynamic and create libraries and frameworks and abstraction layers, there is a real push to, you know, you really do need a real language for that, and an imperative language, and ideally a general-purpose language with a good ecosystem of support, right? So, I think this is kind of one of the the strengths here.

There’s a third model I’ll talk about, which is where you have kind of specialized requirements, right? And so one of the pitfalls of when you have teams having to use libraries and frameworks and abstraction layers for building their infrastructure, is that in some cases that might not meet their needs, right? So it tends to kind of limit what you can do, and in some ways that can be a good thing, where it’s like, you know, you want to kind of simplify what people can do and also make sure that it’s, you know, everything is built really, you know, properly according to your kind of policies and good design and good oper—, operability and those kind of things. But sometimes you get teams who have, or have more of an edge case and need to do something different. So an example of that that I’ve seen, is teams working on say, machine learning.

So you have like this kind of, like abstraction layer that’s in place, or a platform that is really good at you know, is focused on say application servers for your Java and dot-net or whatever kind of application stacks doing kind of web things and services, and restful things and so on. And then you’ve got a team who’s trying to do kind of machine learning stuff, and they’re using weird tools, they have weird requirements to run them on and maybe using some unusual services from the cloud provider. And so these are cases where, you know, you need to kind of have the options for people to kind of do things a little bit outside, right? And so there’s a couple of ways this can go, and you tend to have, you tend to have in those teams with the developers, you tend to have people who do have some infrastructure knowledge or you need to bring people with infrastructure knowledge in to support these teams a little bit more closely.

And so how they can do that is in two ways. So in some cases these teams will go and use that kind of static, declarative type language, the low-level kind of tool like Terraform and CloudFormation because it gives them the level of control they need to do their thing. In other cases. It might be that actually, there’s that, you know, rich enough kind of set of libraries and things around from a more kind of dynamic library of infrastructure code that we can use for, you know, this team can use for a lot of what they do and then they maybe they need to just write some code for their specific things that that doesn’t cover. And so, like, they write dynamic code for that, right? And maybe they write libraries of components that can then be reused.

And I think what’ll be interesting over the next few years as these things all gain more traction, is to see what the kind of, I guess, market of shared kind of reusable components, and frameworks, and so on emerge. Whether it’s open-source components or commercial or some kind of a mixture. I think we’ll start to see a lot more of those and so you’ll have a lot more options to draw on things maybe for your domain. So maybe, you know, frameworks that are tailored to some of the requirements of say financial services, you know, you know things that are set up to help you with compliance with regulations and those kind of things. So I think that’ll be an interesting space to watch. And I think this comes to just this idea of application driven infrastructure, right? And so again, this is kind of what I talked about with that high-level stack project model, but applied potential at a higher level, right? So it’s, you know, you might define what your application needs.

So in this case, we’ve got a service which is a product browser for our online store. And it’s going to have requests coming in from the public so directly from end users. You know it needs a database and so we define a bit of information about what that is. It needs a run time and it runs on an application server, job application server, and so it’s simply, we declare all that stuff and then the layer underneath then can work out, a) do I need to provision stuff, maybe it needs to dynamically provision some specific components, maybe going to provision on MySQL, you know database cluster for this application, or maybe it’s going to reuse an existing infrastructure. So maybe there’s a shared networking that’s already out there created by a common stack, or maybe say a kubernetes cluster or what have you created by another stack.

And so then this, what this, what will happen when this kind of definition, the specification gets read, is it’ll allocate. Okay, I’m going to, you know, grab you some space on that cluster or what have you, and make it work. But again, this is creating, you know, the idea that you can think first about the application, what its requirements are, then the infrastructure can be generated to satisfy those requirements. And so this is what kind of cloud native is right? This is kind of the idea of cloud native is that you, just, you know, we have some kind of emerging standards, things like the open-application model, you know, and things along those lines which are saying, you know, here’s ways and frameworks and specifications for, for doing that, for saying what your application needs and then creating the infrastructure underneath.

And I think this is good, right? This is a valuable thing. It’s a really useful approach. For new applications it tends to work well. I think that the issue is that when people talk about cloud native, there’s the concept which is kind of generic, it’s not really implementation specific, but then in practice when most people talk about cloud native, they’re really talking about containers, container clusters, and service meshes, and they’re even really talking about a lot, for a lot of people, when they say cloud native, then it’s kubernetes, right? It’s kubernetes and docker and maybe Istio. And so it’s kind of a fairly, I would say narrow, that’s a bit unfair, but it’s like, it’s, you know, it’s a certain kind of architecture which is fine if your applications can fit into that architecture or if you’re building new applications for that.

That would be my kind of, that’s generally my advice, you know, when building new things, is to target this kind of a platform because this is the kind of, this is where things are going. But a lot of the clients that I work with at ThoughtWorks are, you know, hey’ve been around for in some cases decades. They’ve got a lot of stuff, right, and they can’t, it’s not all going to fit into containers, or run on kubernetes, or it’s going to need to be ported and it’s not always trivial, right? And so you kind of need, you know, it’d be nice to have a way to satisfy the needs of existing applications and to kind of help that that kind of pathway.

And so this is where I think when I talk about applications as an infrastructure, I’m kind of thinking that we need to go broader than purely the cloud native, the kind of kubernetes-based things and create the ability to generate, you know, so we have that specification, that specification for an application and what infrastructure it requires. We should be able to support the idea that those requirements might be virtual machines, that might be static networking structures, rather than kind of service mesh. It might even be bare metal. We might need to have, you know, a tool like crowbar or something go and provision, you know, a server on a rack in order to, to make it ready for this particular application, right? So I think we can use, this is kind of the value of the, the dynamic infrastructure programming tools like Pulumi, is that it opens up the possibility to do this kind of thing.

And I think a critical thing that we need to do to make this all work is we need get better at infrastructure design. We need to get better at drawing on lessons from the software engineering, and software design world, and applying those to our infrastructure. And for me the fundamental thing, and in the book, you know, this is, this runs throughout the book that I’m constantly talking about. We need to build our system so that they’re easy to change. I think one of the pitfalls we fall into with infrastructure is to kind of think of it as something that’s not really going to change.

We’re going to build an environment, and build a cluster, we’re going to build a thing, and then we’re done with it, right? So when, you know, when I’ve talked to people about well, how are we going to deliver changes to this? Should we make pipelines? Should we have some automated tests for this? It’s often, dismissed. People often dismiss it and say well we’re not going to need to do this. We don’t need to have tests for our infrastructure, like automated tests for our infrastructure, because it’s built, we test it, and we’re done. And then what you find is that actually teams who manage infrastructure spend a lot of time on things that are changes even if they’re not thinking about them that way. So they’re rolling out patches. They’re making fixes.

They’re making improvements. They’re upgrading elements of the, of the system, and this is constant, and I think when you look at what happens in many organizations where the infrastructure function and capacity is seen as a real bottleneck. A lot of times people want to, organizations want to move to the cloud because they’re saying, well, you know, we just can’t get the environments we need, we can’t get enough, we’re running on really old versions of core software like operating systems and databases and application servers. We’re just not able to keep up and it’s because in our infrastructure world, we haven’t really built it in a way that changes a routine thing.

We still view changes as an exception. And so that is something that we need to kind of get over, right? And so the important thing is to make sure that we know, we were able to change our infrastructure rapidly. We can make frequent and quick changes, and that we can do them reliably, and safely, and repeatably, and these two things complement each other. They’re not in, they’re not things that we have to choose between when actually the faster you can make changes to infrastructure the more reliable you can make it, the better the quality, you can remove technical debt, you can, you can, you can improve it more rapidly.

And then the better the quality that you build into your system, the easier it is to make changes. So these are really important things and why I really tend to emphasize things like testing as a part of an infrastructure as code, kind of ways of working. And another thing we need to do to get into the design issue is it’s about making smaller units of change. So I was with an organization awhile back, which had a large set of infrastructure that they manage in Terraform and it was, it had grown so large that it took, with running Terraform Apply, took anywhere from an hour to two hours. And so it was a very big Terraform project all in a single state file. And so when we talked about, when they were first started looking at well, how do we make this more manageable, you know, their answer was modules, right? Let’s break our infrastructure, you know, Terraform project into modules and that will make it better organized and easier to change.

I think the issue there is that those things, like modules and another kind of components, for you know, that use to assemble and create a stack, make a smaller unit of change for the code, of the code levels. Okay, my code is now in a smaller project that I can define, manage, and version and even test, which is great, but it doesn’t change the unit of delivery, because you know delivery is still that massive stack that takes an hour or two to apply, and so the fact that you’re pulling a bunch of modules together and you’re still applying them in one go, with one state file, and so on. So really the path is to split things apart and treat stacks as components of the overall environment. And so this is where you have multiple projects stacks, what have you, that have to integrate together.

And so then this is where design comes in because what often happens again in these cases is where people say, okay, you know, I’ve got my stack that defines this part of the environment, but in order to test that stack, I’ve got to create a whole bunch of other stuff. I’ve got to create all the rest of the stuff, because of the dependencies on there. So we need to make sure that each stack is independently deliverable. And this is where I use a pipeline, I use a pipeline to say that for each infrastructure project, each stack project, I can spin up and test an instance of that on its own without the other, other things around to make sure, hey, part of the part of the reason for doing that is to make sure that it’s correct, but stacks tend not to be, have that much variation.

So it doesn’t tend to need like loads of test coverage. It’s more of kind of a sanity check, but the really important thing is just to prove that, yes, I can create an instance of this stack on its own, because it forces the design, it forces loose-coupling. And so this is the key thing here, and one of the things that I think we need to get better at with infrastructure, is saying that, you know, how do I make my, my stack composable and more loosely-coupled where I can bring it up. So let’s say I have an application server infrastructure, build an application server and maybe, just like, you know, some networking around that, but I’ve got to deploy that into shared networking V-P-Cs, subnets, and all that kind of stuff.

So I depend on that, right? How can I test my application server stack project, without also having to create the full networking stack, you know, from another stack project, in order to do that? How can I make that loosely-coupled? And so there’s a few things to do, one is to avoid integration into other stacks that is deeply integrated. So you have things like the Law of Demeter, which says, like, my code in my stack shouldn’t know the details of what you’re providing in your infrastructure stack, right, and your network to your shared networking stack. And so this is where I’m really, I really dislike this whole thing of integrating a stack by, by looking at another stack’s remote state, for example, because you’re basically integrating at the data level.

This is what, so in the application development world, people used to do this with applications where you have my application needs to integrate with with your application. So I’m going to go connect to your database, and I’m going to let your database schema, and integrate at that level, then that means that it’s very difficult for your application to change its database schema, without breaking my application. So that creates a coupling of the database, and in the software world we’ve learned that’s not a good thing, right? We’ve learned to kind of stop doing that and to kind of identify that as a, as a problem, as an anti-pattern.

In the infrastructure world, there’s still a lot of people recommending go ahead and integrate your stack with somebody else’s stack at the level of, you know, the state filed data structures, and so on. So I think what we need to do is to kind of abstract that out, and say, I’m going to build my stack so that it needs a parameter, so it needs a sub-net to create my servers in, but I kind of don’t care how you create that sub-net, what you name them, or anything. So you need something else to kind of, like, pull those together and what that means is I can create, rather than having to create an instance of the full share networking stack, which might have a lot of things in it, you know, to have a fully robust production-ready hosting and networking infrastructure.

Maybe to test my stack I can just create a sub-net and hook it into that and test it. I can use it like a fake, you know, what we have in the software testing world. We have, you know, fakes and mocks and these kind of things. So you can do that with infrastructure as well. If you really kind of think at it and work on it and this is one of those things that I think, again, as an industry, we need to work towards, driving towards this, recognizing that we need to have better patterns for this, and then implementing our tools and our ways of working to support that.

So this is the kind of stuff that I’ve got in my book. So please do have a look at it, pre-order it or, you know, have a look at when it comes out. And also, you can reach me on Twitter @kief and thanks a lot for taking the time to listen to what I have to say, and I hope to have some good conversations around this. Thank you.

Learn more

Discover the getting started guides and learn about Pulumi concepts.

Explore the docs →

Pulumi AI

Generate Pulumi infrastructure-as-code programs in any language.

Try Pulumi AI →