Upgrade Strategies: An Introduction for IaC
When you’re working with infrastructure, you’re inevitably going to need to upgrade or update that infrastructure. Whether it’s an operating system update or a desire to get CPU or memory upgrades, you will need the ability to pick resources and change them as necessary. In the past, this kind of upgrade would be done on the basis of individual resources, with each one being updated and checked either by hand or programmatically before moving onto the next resource. If you’ve ever done a database migration or if you ever did the recommended way of upgrading your computer’s operating system including all of the backup steps, you’re familiar with this process. Stand up the new resource. Check everything works. Move over the data. Check again. Tear down the old infrastructure. In a cloud computing environment, though, you’re often dealing with hundreds or thousands of resources, and doing one-by-one replacement is a nightmare that takes ages. However, there are other options, many borrowed from the application deployment world, that we have available to us because we write infrastructure as code.
Generally, there’s a few strategies for replacing something in a cloud computing environment. You’ll often hear about them as deployment strategies. These strategies generally differ based on the order of operations: Do you create a new resource before you delete the old one? Do you replace some and pause to gather data? Do you YOLO and just toss everything out and build new? All of these strategies are worth considering depending on the needs of your situation. A production system likely needs to maintain uptime, or the amount of time a system is available to the end user, to meet service-level agreements (SLAs), which might include an availability promise such as a promise of 5 nines (99.999%) of uptime. In that case, a YOLO strategy won’t be very acceptable, will it?
As many of these deployment strategies initially came from the application world, we can’t just use the same deployment strategies exactly as they appear for an application because there’s a bit more going on under the hood. We need to consider what we’re building and where we are on our stack. If we have multiple instances of our application running on identical containers, for example, we can treat the containers a bit differently than, say, the single load balancer in front of all of the containers or the node underneath a single-node Kubernetes cluster.
To illustrate all of the following deployment scenarios, let’s imagine we have a system where an application is deployed on some infrastructure. What the application is doesn’t really matter; we’re going to focus on the infrastructure here. We’ll pick a simple situation: There’s a security update for one of your pieces of infrastructure.
Let’s say the infrastructure you’re working with is a large number of virtual machines (VMs) that are running an outdated, insecure version of an operating system. If you weren’t concerned about uptime or the stored data on those systems, you could certainly just tear them down and stand up new ones—this situation could reflect that perhaps you’re working with a development environment that is ephemeral and isn’t in use right now, or you don’t have any load just yet. Some folks refer to this kind of upgrade strategy as a “big bang” strategy, though folks who run data centers would likely say that the big bang strategy still at least required testing and careful consideration of data transitions and you would rarely completely wipe the original system before replacing it with another. If you’re wondering where this upgrade strategy came from, it was originally used in non-cloud-native systems where you likely didn’t have the hardware available to run other deployment strategies. In these cases, you would have maintenance windows and change processes, and the whole process was mapped out well ahead of time. This kind of fast switchover, especially with little to no testing, really is not a good idea for a cloud-native or cloud-based production environment, and it’s frowned upon if you’re working with a cloud-based environment that others are also using. There’s so many better options than scheduling maintenance windows and ripping apart systems when you have the capability to stand up parallel virtual hardware.
Now, let’s say that instead of tearing down and then standing up new VMs, you instead were to create an almost identical environment with the new operating system, test it under load, and then transition your traffic over and wait for success before tearing down the old version. This is known as a blue-green deployment, and it’s a fairly popular strategy. A blue-green deployment is, in short, a create-check-delete process. You may have heard this strategy called “red/black deployment” or “a/b deployment.” The rationale behind using “blue” and “green” as names originally comes from needing easy names that didn’t have any kind of connotation of one group of systems being “better” than the other (e.g., “red” is the same color as alert lights, so don’t you want to keep the “black” deployment?). So you certainly can name it whatever you’d like in your company, but the concept is the same. The “blue” environment is currently running. To upgrade, you stand up an almost identical environment with the upgrade you want to perform. That environment is your “green” environment. Then, you run any test that you can against that green environment to ensure it’s ready. Finally, you switch your traffic over to the green system and monitor for any issues. Once you’re sure the environment is stable and can handle the load of your normal traffic (typically measured in hours to days depending on when you made the switch and what your traffic patterns demonstrate), you tear down the blue environment.
Note that, while a blue-green deployment traditionally is standing up and tearing down entire systems, you don’t have to do an entire system if you can do a subsystem instead so long as that subsystem is self-contained. The idea behind the rollout strategy is the same: You use a load balancer to transition traffic from one complete system to another. Since we’re talking infrastructure, the more traditional version needs to be modified by thinking of subsystems. For example, if you’re replacing the load balancer as well, you would stand up the green load balancer, point it at the rest of the blue deployment, make the switch of the traffic to the green load balancer, and then complete the move from the rest of the blue deployment to the green deployment. Then, finally, the blue load balancer and the rest of the blue deployment can be decommissioned.
Now, a blue-green deployment isn’t the only upgrade strategy, and it’s not always the best one for a cloud-native system. There’s also a strategy called a canary deployment. In a canary deployment, you stand up new infrastructure and move traffic over in small increments, such as 5% of overall traffic. That small slice of traffic is considered a “canary,” an indicator of failure or success of such a move. While it’s often used in application deployment as a way to gather user feedback on an application change, this strategy is also useful for infrastructure as it highlights issues that only occur under true, random load before they become a problem for a full deployment. This strategy is fairly popular in microservices architectures and with platforms like Kubernetes as it’s much easier to implement than with more traditional, VM-based systems.
The final upgrade strategy we’ll consider is called a rolling deployment. In a rolling deployment, each element of a system is replaced one at a time, with each new instance being checked before decommissioning and removing the old instance. That check is called a health check on some platforms, and the basic idea is that the deployment tool sends a request to the new element and waits for a response that indicates the system is functional and responsive as expected. Once the health check on the new element clears, the old one is removed, and the deployment tool “rolls” to the next element in the system being updated. We find this one in many Kubernetes-based applications, and it can work well for cloud-native infrastructure, as well.
In the next article in this series, we’ll explore how these three strategies are similar and different for infrastructure as code, and why you would use one over the other. Stay tuned!
In the next parts of the series, we’ll try these deployment strategies with Pulumi, exploring how to use code to define each kind with a test system. Watch this space!
Meanwhile, while you’re waiting, we did a few videos on this topic over at PulumiTV, like this one on blue-green deployments with Pulumi and Python on GCP.