Get started with Pulumi in a free one hour introductory workshop. Register Now

Creating a Cost Aware Culture to Drive Cloud Optimization

Recorded on:

As companies transition to AWS, the flexibility in your costs can often bring concerns on how you manage it and how do you avoid that surprise bill at the end of the month. Cultural change is often needed in organizations as they move from IT being primarily fixed costs to pay-as-you-go pricing. A great cost culture is where IT, Finance, and your Business are working closely together, reviewing the IT strategy and spend. Having this cost culture will ensure that the cost of running on AWS is forefront in any strategic decisions or even during the everyday monitoring of the environment.

Presenters
  • Alex Head
    OPTICS Manager, Global & Strategic Accounts, AWS

Transcript

My name is Alex Head and I’m talking today about creating a Cost Aware Culture to Drive Cloud Optimization. I work for Amazon Web Services and I run a team called Optics. Optimization Intelligence for Cloud Systems and we connect the dots between technology, finance, and business for some of our largest global and strategic customers. To give you a little bit more detail on what a typical engagement for my team would look like, here are some examples. So things like: a well optimized review, which is where we would go over those low-hanging fruit and top-five opportunity to optimize and become more efficient.

A cost intelligence dashboard, which is a tool that my team created, to give customers a little bit more insight into their custom usage data and ways to visualize it that might be beneficial for leaders in an organization and also the people whose hands are on the keyboards. Learning opportunities, and we call one of our FinHack, which is kind of a hack-a-thon to save money and be more efficient and driving that efficiency and optimization. And a huge topic that we talk about is developing and integrating cost culture into a company. And that’s a lot of what we’re going to go through today.

So working with a wide-range of customers and also having previously been a customer myself, I’ve seen a lot of different cloud journeys and I’ve seen companies that are born in the cloud. I’ve seen companies that have done big migrations or have had multi-cloud environments. It all ranges between the company and the industry and the size. But one thing that is pretty consistent is they’ll start out, and as you can see on this graph, kind of slowly start their footprint and test around and then all the sudden they have two months that they just go way up. And they realize that they need to put in some controls to track their footprint and track their cost and really grab insights from that. And so when developing that cost aware culture, we really want to start at the basics and grow to, you know, where you should be doing this on a day-to-day basis. Even if there’s already things in play and mechanisms that are going on, it really helps to go through and redefine this and make sure that it’s accessible to everyone.

So today I’m going to break this up into four different topics. So first is establishing the visibility, then defining success, implementing controls, and then how do we drive that accountability?And right— how do we get people to care? And all that really leads to implementing this cost aware culture that doesn’t necessarily have a fire drill like you might see in this graph of oh, no, I spent too much or oh no this footprint grew really big, but makes it so that it’s a day-to-day thing that’s implemented in development processes of being really aware of what we’re doing and the cost implications of that. So first, let’s start with the visibility and this has to be the first step. You have to know where people are getting their data and how they’re viewing it and does it make sense for every part of the business? The key things are we want to be consistent. Right? We don’t want to change every week.

Hey now look at this report for your data, or hey finance is going to use this tool, but teams you can use this tool. You want to find something and create something that people across the organization can use. Next would be accessibility. Is it something that just team leads or managers can see? Or can anyone go see it? And that’s really important because if you want people to care about their cost and their optimization, they need to be able to see the nitty-gritty of it. And detailed, right, we want as much detail as possible and if we can see an overview for someone who might not want to see what, oh this individual E-C-2 instance is doing exactly this with cost, then allow that view too, but the detail is important in case people want to dig down into it.

If people understand the data, they’re more likely to work with the data. My team created the Cost Intelligence Dashboard. So a well architected lab that anyone can do and it creates these views that you’re seeing here. So, things like what’s my usage cost? And how is that growing? Where do I need to be aware? You know, what is my deep dive into storage look like or compute? And having different team views. So maybe you log on and you’re part of Project A and you just see account of Project A’s details. If we’re reporting back to the organization on these views, then we can give people this tool as a dashboard to really drive those. And it’s important, a tool like this for example, anyone can go in and create their views that just them can see, right, but the underlying data is consistent and what everyone is using.

And so it really gives that happy medium of, we’ve defined the data, we’ve defined the visibility, but we also are giving people the flexibility to learn in their own way and get to know this data and present it in a way that makes sense for what they’re doing and what their goals are. So once that visibility is established and not everyone will get it on the first try, right? Be open to trying multiple tooling, or messing around with the raw data yourself, or combining different data sources. But it’s important once that visibility is established to then define success. What does good look like for you and your organization? And what does good look like for each team? So here I listed out some of like the top K-P-Is that I see customers track across the board.

And first one being percent growth, which is not necessarily going to tell you how to optimized you are, but always a good one to see right? Are we tracking normal? Are we looking back historically? When we release a product our growth percentage usually goes up to this. Or when we sunset something we’re able to see this change. And it’s just a good metric to kind of consistently watch, also great for when it comes to forecasting and budgeting for your next quarter or a year. Serverless growth is also a big one. Right? A lot of teams might have a goal around going serverless or what products that have higher serverless growth what do they look like? How do we change this architecture? And how do we define that? The next two are two that I think everyone, no matter what, no matter what cloud platform you’re on, or no matter the size of your team, should be tracking. And that is storage and compute unit cost.

And that— that’s important because you can grow or you can turn things off and your spend can go up and down, and up and down, but the unit cost is going to be a consistent measure of how efficient you are. So say you’re tracking your storage unit cost and you know that for every gigabyte stored, it cost me this amount. And then you do a huge push to move a bunch of stuff into storage, but you do it in a way that you’re utilizing different tiers and you’re making— putting a lot into cold storage. And even though those storage costs went up you’re also going to watch that storage unit cost go down because you’ve become more efficient in how you’re doing it. And the same thing goes for compute. So maybe you’re using instances that are better for your environment that are right-sized better and you’re going to see that unit cost get better and even though your footprint might be growing.

It’s also really helpful for when you have a new team or new project coming on because they’ll have a benchmark. They’ll know, okay, these teams have a unit cost about of this when using storage and compute. So we need to make sure that that’s kind of our benchmark, that we have to be at that point or better. And then my personal favorite is elasticity and that’s because this one is so easy to calculate the savings, to really mess around with and watch the changes and— and watch how teams get better at it. And so, if you’re building an application in the cloud, one of the reasons you’re doing it is because you’re getting that elasticity, but not everyone uses it and a lot of things get left on 24/7 and it might be a sandbox environment or a non-production environment that really is only needed during core hours of the day. Or maybe it’s something that can size down on certain hours and then size back up.

And here in the graph, you can kind of see typical kind of Saturday, Sunday, if it drops what —what those savings might look like. And with elasticity it is so easy to track those savings and to track that impact that you’re going to have and to make improvements. So I always say even if you don’t necessarily have a mechanism like a full-instance scheduler or something like that, that’s turning things on, see your elasticity and see how it gets better or worse or maybe how it changes on time of year just to really map that out and kind of track that. So some A-W-S specific K-P-Is that I look at are, first by tags, so, you know, if you’re tagging your resources what percentage of your environment’s tagged? Or what’s the number of minimum tags that everything needs to have? Or that ratio of what your production resources to non-production resources.

Does that make sense? Is your non-production resources way higher? Should that be the case? Should we make them more elastic? What is that conversation that you should be having? E-C-2, I always look at the max C-P-U when starting an analysis on— on someone’s E-C-2 environment and that’s because average is a little bit more disputable, and maybe in your environment it makes sense to look at average. But to me if I’m looking at the last 30 or 45 days of your environment and I see that you never hit a max C-P-U of more than 15%, then that’s probably something we should look at. And maybe there’s a reason right. Maybe this is a memory intensive instance and you don’t really need that C-P-U and we have a further conversation? It’s a good thing to track and a good baseline to set right? What it— what max C-P-U do you want your instances to be hitting? Obviously, we don’t want them all sitting at 100%.

We want that kind of happy medium. But what is that and what are you defining it as? And setting again that benchmark of this is important when bringing on a new product or a new team. You know, we hit a max C-P-U of this. Also you could— that is something that may not be as easy to calculate savings-wise as elasticity. But it is easy to calculate as okay, well one more at this max C-P-U, our unit cost looks like this. And then if we were to increase that max C-P-U by only 5% this is what our unit cost looks like. And really driving those decisions with data that you have at your hands. Spot to on-demand ratio.

So with spot instances, you know, is your— if your compute or your E-C-2 environment is growing, is your spot environment growing? Are you trying different ways to use different mechanisms to bring that E-C-2 cost down? If we see a customer for example, start using a lot more spot instances then we usually see their E-C-2 unit costs go down. And spot is going to be a good way to also calculate savings, right? You can go in and say, okay, well if I ran this instance on-demand, it would have cost me 60% more. And tying savings back to these metrics is doing that consistent mechanism of having I-T and finance and business talk and stay in the loop with each other and really make decisions that help all parts of that business. Instance age.

This isn’t necessarily one that can help you, kind of right off the bat, but I always like looking at it because what’s the average age of your instance? I mean, sometimes you can go in and average age would be, you know, almost like 300-plus days, which means that there were probably some E-C-2 instances that were released since then that you might benefit from. Or it might mean that there’s some really old stuff out there that is skewing that average, but it’s just a good thing to track and know and watch, you know, how that benefits essentially. Usually if you’re going to have a lower average age for your E-C-2, then that compute unit cost is going to go down a bit because you’re using less expensive instances and newer instances.

And a lot of times, when you do that migration, maybe you realized, oh, I didn’t really need this size of instance and I go to an even smaller instance type. And that also plays into the generation of instances. Right? I mean if you’re running something that came out ten years ago, you can probably benefit by moving to an instance type that was released last year and has, you know, better technology and a better pricing structure just because of how old it is. Something to, you know, set benchmarks around. When I look at customers environments, I always say if there’s anything that’s original E-C-2, original like an M-1 or a C-1, then we probably need to change it up. Or maybe it got left on, which can happen. And it’s a good benchmark to say hey, we’re watching this so keep innovating, right? Keep changing and keep adding different instance types and really seeing what works best.

There’s so many options out there for these teams that you want to encourage yourself, or your team, or your company, to keep trying new things and taking advantage of that. And then storage. So if I’m tracking storage growth and I’m tracking storage unit cost, and for in A-W-S. terms, you know that S-3 cost. Are people using different tiers, you know? We have less expensive tiers than just S-3 standard. So here I show kind of percent in S-I-A, as in infrequent access growth and Glacier Growth with a served cold storage. So if your storage footprint is growing are those tiers growing too, right, are people thinking in a way of oh, I can also move these things to a less expensive tier. And just making that a part of process.

If you track what good looks like, if you say, okay when your storage is growing these things should also be growing, then it becomes a part of that thought process as people build new things and add things and grow your business. So as a recap kind of when it comes to the best practices, when saying what does good look like? We first— we don’t want to pick too many K-P-Is. So we just looked at 10 or 15 K-P-Is. We wouldn’t want to say hey, let’s go track all of that. We want to make sure it makes sense to our environment and what’s important to you. Also quality over quantity, right? I mean if you’re tracking a bunch just to see, then people aren’t going to care as much, and people aren’t going to, you know, think of 12 things that they need to look at before they start a proof of concept.

Next is kind of defining that cadence. Now, this doesn’t mean that people only see these metrics quarterly or monthly. But when are they being reported back? And when are we going to show those successes? We want to make sure, going back to the first step, that the visibility is there for anyone to check at any time. But so they also are aware of the cadence of what you care about right? For tracking these everyday we’re probably not going to see much of a shift. Whereas if we were to track them maybe once a month, then we’re able to kind of show okay, this changed, now, why did it change? Or these were the monthly savings that you received because of those changes. That really plays into number three on this list. So calculate the benefit of the benchmarks.

If our elasticity goes here, we’re going to save this much on our current environment. If we can make our E-C-2 unit cost to go to this point then our environments going to get to this point. So really showing the benefit there because that’s going to tie back in finance, right? These are adjustments that really the technology side of the company are going to have to make. But we also want to keep those dots connected and make sure that finance understands why they should care. And really also understand the benefit of the work that someone put in. If they did all this work to, you know, put in policies to change storage tiers, we want to be able to celebrate that, and say, hey finance because the team did this we saw these amount of savings.

And then last is granular but not too granular. Right? We don’t want to necessarily look at these K-P-Is across-the-board. Say you have 20 accounts or three big products and they have multiple accounts working with them. We don’t want to just look across the board at those metrics. We want to get a little bit more granular so that people can actually make changes and also drive where something might be coming from. Maybe one product is significantly more expensive than the other and then you can really drive to where that comes from. And also set those benchmarks, right? So most of our teams are about here when it comes to these K-P-Is.

So why aren’t you there? Or you know, why is this team doing better and really defining that success? So to get that granularity you really have to implement some controls, because as I said, for most customers looking across spend isn’t that helpful, right? It’s not giving us that many insights and it’s more of a general number. And so we want to make sure that we’re using resources to achieve granularity. So some of the A-W-S resources would be A-W-S Organizations, or our linked account structure, or tagging. And as an example of— kind of some successful ways that I’ve seen this done and levels of granularity, would be using A-W-S Organizations to kind of define products. So here in the example I show, say the product is S-D-K and teams.

So we know that accounts that fall into these buckets are those products. Then we take it a step further and we looked at, okay, let’s name our linked accounts so that we know and understand what they are. And it’s to me— I see a lot of people name those linked accounts based off of the environment. So for teams, say we have a production and a non-production environment. Okay, that’s good, we’re— we’re you know, we’re getting that more level of granularity. How do we take it one more step further? and that’s going to be by tagging the resources in those accounts. Now, these are just examples of some that I’ve seen, you know, been important for customers but things like what version of the product is this? V-2, V-1 or maybe future version or you can really track Llke, okay.

Well V-1 cost us this much or V-1 has this unit cost, but V-2 has this unit cost. Cost center. Again, you want to bring back in finance and business and to that technology decision you want to— if you’re going to mandate certain tags so that you get those insights from a technical standpoint. You also want to see how it’s going to benefit finance, right, and business and— and really being able to relay those costs back. It’s also super helpful when it comes to budget season and forecasting, because you’re going to have right then and there, okay, I have to present a budget for cost center 80.

Now, Let me go see what all resources fall into that. And then schedule. This is one that I always tell people to do even if you’re not using like an instant schedule or anything like that, but just labeling a resource of is this a 24/7 resource? Is this something that is only ever touched during the day? Or maybe it’s something that you know, does have to be on 24/7, but has some flexibility and defining that is going to help a lot really see what kind of flexibility and elasticity you have opportunity-wise.

And give you a little bit more granularity And now here I picked kind of three tags to mandate right? I think it gets tough when you get more than three tags. Definitely more than five because that’s asking for a lot of questions in that standpoint, right? So, you know say you’re— you’re checking out at the store and they ask you six questions before you can buy something. You’re a little bit more hesitant to check out there. So, you know, make sure that if you’re mandating some of this and that you’re being kind of reasonable in that granularity standpoint, right? And like I said, you’re being granular, but you’re not being too granular to where you know things are going to kind of get lost and maintenance and hygiene of that data gets tough.

So to mandate some of those tagging and enforce the tagging— to make sure that people are doing this and you’re getting accurate data, you know, that visibility and that defining success isn’t going to mean anything if it’s not clean data, and you know, accurate and says the right date— the right information that you need and so I gave some examples here of two things that you can do in Pulumi that help with that enforcement. So using policy as code and that— enabling that policy pack to, say you have to have these three tags when you create this. And one of the things that I like about that is when it comes to tagging you could, if people kind of go rogue, you could end up with hundreds of thousands of tags, right? So you could have environment spelled five different ways. Capitalized. All-caps. Abbreviated. Whereas when you’re enforcing tags this way, you can kind of say okay.

This is the tag that you’re adding. So, you know, don’t go rogue in that standpoint. nd then infrastructure as code to automate some of that tagging and looking at some of those things that you’ve created through Pulumi and being able to go back and change them. Or maybe you’ve shifted some cost centers and you can kind of go back and automate that. So last is is driving that accountability. You know, why— why do people care? How do you actually make this a part of your culture? And you know, not just something that people say, oh it’s you know, so it’s budget season, we have to do this.

But something that people think about when they’re drawing up that architecture plan and they’re thinking about when they’re thinking of new products or new versions of things to bring on. And these are, you know, some of the ways that I have— have done, have seen customers be successful in this standpoint. And the first thing is really that gamify, right? So make it fun to save money. So go and you know show when people have saved more money, or put you know a reward out there. It’s so common for companies to, you know, say hey, here’s a list of everyone who needs to, right size, or who has idle instances sitting out there.

But if you’re going to do that, you also need to reward the good, you know? Say, okay, here’s a list of all these teams were able to bring their unit cost to this point. Or this team started using spot instances and we saw this amount of savings. And so, you know gaming the process of really saving money and optimization. One of the things I referenced before is we sometimes do an event called a FinHack, which is where we learn some of these levers of optimization. Maybe it’s a spot instance. Maybe it’s tiered storage or— or maybe it’s diving into this cost and usage data. And then we all go break up into groups and kind of use that new knowledge to find ways to save money and see who can find the most savings, right? Make it fun to save money.

Then you want a reward, right? So, like I said, you don’t just want to send a list of hey, here’s all the people that need to save money and— or that aren’t doing a good job. You want to say here’s the people that were successful and here’s the opportunity. And then you want to set a regular cadence right? You want people to know that on the 30th of every month, or maybe it’s the first Monday of every month, they’re going to know where they sit right? They’re going to know that someone is looking at this and not only noticing if something might be bad, or there might be something that needs to be fixed, but also noticing when you’re doing something good. And that you’ve put kind of effort in there.

And driving that accountability, right, you give them the right tools, now how are you getting them to care? And so as a recap for all this, so, you know, we talked first about that visibility piece and making sure that customers or that people on your team can see the cost and usage data, right? You’re not going to be able to make anyone aware of their cost if they can’t see it. And also if they’re not all using the same consistent way to view it. And once you kind of establish that visibility and that way that people can go, then you define what does that good look like? So when using this data, you want to make sure that you are looking at these top-five things because that is what we’ve defined as success. And those can change right? You’re not locked into them.

But you have to define them and you have to document them. Okay, so we’ve been given visibility. We know what it looks like to be good. But, how do we get that granularity to actually action on that right? So— so to actually make that data and that success K-P-I important to us and that is where we implement controls. Right? We implement ways that it’s— doesn’t have to be an afterthought to add that level of granularity through something like tagging. It’s just automatically a part of the process, you know, taking that one less step out of it and making people automatically get that granularity when they build something. Then driving that accountability, right? You’ve— you’ve automated the granularity.

You’ve told people what it looks like to be good and successful, and you’ve given them the right visibility so that they can go through and— and see that data and really in real-time react to it. Right? Not just get a report at the end of the month that they’re going to react to when they could have started making changes before. And you’ve done all those steps and now you’re going to drive them to be accountable and to actually put it in their day-to-day, right? This last step is really where you make it a part of your culture. And that is it for my talk today and let me know if you have any questions. Thank you for having me.

Get Started with Pulumi:

Study
How Learning Machine Improved Time to Ship

Pulumi helped Learning Machine deliver a more agile, streamlined DevOps experience for the benefit of their customers, with significant reductions in code and boilerplate.

View the case study »
Ebook
Implementing Cloud Native Infrastructure for your Organization

Learn how Pulumi's cloud native development platform provides a single model for building and deploying infrastructure as code.

Download the ebook »

Learn more

Discover our complete set of examples, documentation, and API references.

Learn More

Contact us

Have questions about Pulumi? We're happy to help.

Contact Us