Testing in Production


Testing in production: it’s gotten a bad rap. People seem to think it’s all about irresponsible YOLO-ing and taking shortcuts around the sacred processes that you rely on to catch bugs in staging, before they make it to prod. Nice theory: completely wrong. Staging areas and controlled environments will never turn up the interesting bugs: that takes real data, real workloads, real concurrency and real chaos. In other words: production.

But it gets worse! Staging isn’t harmless; it’s a black hole for limited engineering cycles. By sinking time there you starve yourself of the cycles you should be using engineering tooling and guard rails for production.

Production is quite literally the only environment that matters, so every moment you spend working anywhere else is time spent absorbing the wrong instincts, running the wrong workflows, and gaining more false confidence in your code. Only production is production, and the only way to gain confidence in your code is to bake it in production over time and a range of workloads. So let’s talk about how to do this responsibly – without impacting your users — using tools ranging from capture/replay to canaries, production load testing to chaos engineering, and the instrumentation-based observability that must brace and validate every effort if you plan to sleep well at night.


  • Charity Majors
    CTO and Co-founder, Honeycomb
Show video transcript

This talk is about Testing in Production. I Test in Production. So do you, everyone does it. Everyone does it, whether they admit it or not. And I actually feel like the problem is not that we do it and it’s not that we Test in Production as a problem. It’s the fact that we’re tuitioning to admit it. Which prevents us from naming what we’re doing, identifying it and improving upon it. Things that happen in the dark, don’t tend to improve.

My name is Charity Majors. I am an Operations Engineer by trade. I am the Co-founder of The world’s first observability tool. I have spent a career being the first infrastructure engineer who comes into a small team of software engineers and helps them grow up. I really enjoy doing that. I’ll do a database stuff. I wrote the Database Reliability Engineering book with Laine. And list functions and I have the observability book coming out for Riley in the next few months. Pre-release copies are available now. And you can tell that I’m from Ops because this is how I feel about software.

The only good diff is a red diff. Testing in Production has really gotten a bad rap. I’m mostly blame this dude. I feel like it’s a funny meme, of course it is. I used to have this poster on my wall, it’s hilarious. I don’t always test my code but when I do I Test in Production, that’s fantastic. But there’s like this implicit dichotomy there where it’s like this false implication that Testing in Production somehow implies that you don’t test many other ways. In fact, you can do both and, and you must do both. And I would argue that Testing in Production is no less important than unit tests, integration tests, all that jazz. In fact, if I had to choose only one and I do not, let’s make that perfectly clear.

I do live in a world where I can have both. If I had to choose only one I would go with the ability to Test in Production because it is the one that is most embedded and grounded in reality. If I could only have unit tests and integration tests and had no ability to look at my software while I was running. I would be worse off that if I could not do a unit and integration tests, but I had full access to a range of rich production testing stuff. But like the idea that you really have one or the other or, or that, you know, good engineer do want another the other is wrong, it’s misleading and it makes energy.

It makes people waste energy in the wrong places. But like given that we all do it. And given that like officially, like anytime that anyone shifts any change, that changes a test right? There’s a unit there, a fear to spoke complexity made up of that unique and non repeatable intersection of artifact, deploy process, you know, environment, system state. Everyone does, if you have production, you Test in Production. So what’s the big deal right? What’s, what’s all this fear and fuss about. Well when you say Test in Production, they tend to hear this. You know, they hear Cowboy coding.

They hear people logging in and just getting a root shell in the database and like hand editing the table schema right? They hear you not giving a shit about your users. And to be clear, those things are really bad. It’s fine for knees to jerk a little bit over that. Testing in Production, like any powerful thing, like can be done really badly, really easily, pretty much anything you do in production. It can fall under this category. I would, I would actually argue that the ability to successfully and safely Test in Production requires a significant amount of architectural and automation and sophistication from understanding of the best practices, the ability to design systems and tweak them from the ground up to themselves.

Well, to this form of testing, it’s not like you just flip a switch one day, or you just buy a tool and suddenly you’re Testing in Production well, no, that sounds really scary. And that is not possible. Some caution is wise. And it’s also true that in some ways you must be this high to ride this ride. You know, like Sam Newman who famously said, Martin Fowler, who said that about microservices, you must have mastered the fundamentals in order to move on to the advanced concepts. You must have mastered unit tests and integration tests to be able to move on to actually running, you know, continuous 20% load tests in production on the same hardware as your production services. You need to have a really strong background in operations.

I think in order to both build the systems that can do this. To do it successfully, to how to use them and then to pass on that knowledge throughout your internal tribe. You know, like each one, each one of our systems is this really complex, wonderful, beautiful little snowflake right? It is an intrinsically unique, complex socio-technical system. And what that means is you and I could learn broad principles from each other. We can tell specific stories, but the interpretation of those stories, the application of those techniques is to be left up to, you know, the hero, the recipient, because your system is not my system.

I don’t know what’s best for your system. You don’t know what’s best for my system. So a fair dash of humility is often required to because, you know, we can, we can tell you what the rules are all day long but rules are made to be broken. So, and again it is never a substitute for pre-production testing. You can have both. I always test my code that I tested to get in production. This, this is the man we need in our life. I started getting talked about, you know Testing in Production three or four years ago. Mostly I’ll be perfectly honest because it made people lose their shit. And I found that hilarious.

It seems to push people’s buttons a little bit less these days which is probably good, but it also makes it less fun. Anyway, I don’t think that this is just a fun argument over a provocative phrase though. I think that it is really kind of a struggle for the soul of our industry. And while there is no question in my mind that my side will win. There’s quite a lot variance in like how quickly we could win? How decisively, how many engineers you know, out there today like so many engineers are are burning themselves out there. They’re giving their all their you know, they’re going through the fire they’re going through hell and it’s not necessary.

A lot of it is not necessary. And that’s what wakes me up every day is, is the just as burning anger at how much of our lives, how much of my life was wasted, bullshit that I didn’t need to do. That a computer should have been doing or that you know, this profession can be very inhumane at times. I love it. People always ask, what would you be doing in tech? If you weren’t in tech? I’m like I don’t fucking know, like, obviously I’d be in tech. I was born to be in tech, but we need to make it more humane. We need better tools and production itself in general. I think it needs a bit of a rebrand.

You know, we in Ops, we were so terrified, you know VAR models going down. Every time a bad actor showed up that we became very non welcoming shall we say, very like, stand back. It was my turf. Cut you if you tear to step on it. So we’ve got some, we’ve got some ground to make up there as well. I really feel like we should try to make production list of the glass castle right? And make it a little bit more of an adult playground. Something like that. Software has some pretty big problems. I assume you all are familiar with the door reports and with Accelerate. Let me get the canonical four questions. How often do you deploy? How long does it take for code to go live? How many of your deploys fail? How long does it take to recover? I would add a fifth.

I think every team should be tracking these five. The fifth being, how often are you paged outside of work hours? I think every manager should be graphing these. You should be looking at these at least weekly to see if you’re headed in the right direction or the wrong direction. And if you haven’t read Accelerate I assume all of you have, you seem like smart people. I can’t see you, but smart people. But like the key finding from Accelerate was just that, you know these, these first four metrics could tell you how how high performing a team you are or aren’t roughly for most people.

And furthermore, you can be kinda more high performing team by choosing those metrics but by teaching to the test, right by like, by making it so that, you know, your deploys go out much more quickly after your code was written making it so that they, you know, blah, blah, blah. Pessimist to that in order to just show you my, the main thing we wanna focus on but it’s just that. There’s a big gap between like the [indistinct] elite teams and the rest of us, and it’s getting bigger. And if you look at those numbers like deployment time on-demand multiple deploys per day for a week, once per week to once per month for low to medium, that’s a lot of wasted hours. That’s a lot of time. Some engineers are spending, not doing anything interesting or new, not focusing on learning new skills not doing anything to move the business forward.

They’re just fighting against the tyranny of their own internal systems. All, all this, all of it is pre production, spoiler alert. It’s big and it’s getting bigger. And we waste a lot of time. This is the Stripe Developer Report. I recommend spending a few minutes of this. If you really want to be shocked sober or, or whatever the equivalent is for you. We waste so much time on, on stuff that doesn’t move the business forward. It doesn’t move. That doesn’t make you learn anything new. Doesn’t make you create anything interesting. It just simply doing the work that you have to do in order to get to the work that you want to do? It’s reproducing bugs, errors, figuring out what to do, doing the wrong thing cause you couldn’t see what you were doing. Having to redo it all. Dealing with technical debt or do yourself in time and space, figuring out what the last people to work on this code base for doing.

You know, the glass has metaphor comes into play big time here. Like if you can’t see where you’re going, when you drive down the road, you don’t drive very fast, right? When you’ve got full visibility, yeah you can sit back and cruise. Your feedback loop is going to be long and lossy. And you’re going to spend more time just studying yourself and looking for, you know, evidence that you’re still on the right path than actually doing the work. What’s messed up is that we think this is normal. That we think that this is, just the way it is. She got a job doing software. It’s not normal and it’s not inevitable. And basically the whole theme of this talk that what I want to talk about is just how do we get there? Because you know there’s this enormous shift underway.

Testing in Production is part of it. And we’re going to talk about that a bit, but it’s not all of it by any means. And you know, the Delta that you see in the, the elite teams that are just ballooning and reaching escape velocity. These are the teams that are leaning in and adopting all of these new best practices and tools around production. These are the, these are the teams that are abandoning all the wasteland of crap that you know, we’ll get to that. Final takeaway from the Stripe Report 42%, 42% of the average engineers time, the average engineers, not even like it.

There are people who are way worse than it. Half of your day, half your week, half your life goes to bullshit. I think we can do better. And for individuals, it really paid off to be at a high performing team. I this is not a question of how good are you as an engineer? That’s not the difference between high performing teams and like elite. A 1000% guarantee [indistinct] because I’ve been on the both sides of that. It’s about the team. It’s about the context of the team. It’s about you know, the hoops you have to jump through in order to get a pull request accepted.

It’s about the strength of your CI/CD pipeline, how automated it is? How good you have automated error checking? Like all the stuff that enables you to move quickly with confidence. That’s what makes you a good engineer right? It’s not, it’s not that you get to become an elite team by being the best engineer. You get to become an elite engineer. By finding out of the best teams and joining it. And what are those elite teams doing? They’re investing in production, observability, instrumentation, picking up the pace, shrinking the time to deploy, you know, educating themselves, making sure that every person on the team is production literate.

Is production capable, can follow, could sit here you know, writing code, knowing what they’re building. They’ve got you know, that original intent up in their head, it’s beautiful. And then they hit, they save their code right? They merge it to main. They stand up and stretch a couple of minutes later, it’s in production and they go look at it using the same eyes that they just used, to write the code. They go and they look at it through the lens of the instrumentation. They just wrote asking themselves, how am I going to know if this is working? When it’s in production in a few minutes.

And then they go and they look and they ask themselves, how could I know if it’s doing what I wanted it to do or not? And you know what, does anything else look weird while I’m here? If you can get a team full of people who can do that, who are trained to do that, who learned to associate dopamine hits with doing that. You’re going to be doing well. By the way, the Honeycomb team our [indistinct] and metrics are about an order of magnitude better than that, of the elite performers in this bubble. And we did not go out and hire all of the best engineers. We hired good engineers.

We’re good at communication, who are, who like learning new things, who were collaborative and who wanted to pick up this crazy ship? We were trying to sell people on, you know, observability. And now there’s some of the best engineers in the world. He didn’t start out that way. That’s cause that’s not the direction that that causal loop goes. If you’re on a shitty team, nevermind. This is not my excuse to tell everybody to quit their jobs. That’s a different talk. Anyway, there are a few different ways though to talk and think about running in production. There’s more than one way. When I say, you know, get your shit in a production I don’t mean, make sure everybody sees it immediately after you write it. ‘Cause you’re right.

That would be stupid. There were three ways and I grabbed this from one of Cindy, Cindy’s great articles in Testing in Production because I really liked it. It really maps to the way I’ve always thought about it, which is, ummh, and she grabbed of course from this some great articles on the dearly departed to Turbine Labs had some great writing about releases. So for deploys, here’s what they. Here’s how they defined it. It is the process for installing the new version of your services code and production infrastructure.

No, that’s not like for re-imaging all of your, all of your, you know, everything, everything from scratch, it’s your business logic right? Your code that you’re actively developing. Deployment doesn’t have to expose customers to a new version of your service right? And it probably shouldn’t nevertheless, you’re getting that code into production right? That counts, it totally counts. One thing this does very nicely, is it minimizes it possibly even entirely eliminates the need to maintain separate dev tests and saving environments. Which then invariably become dependencies and need to be kept in sync with production, which takes up half your engineering time.

And it is in Mozul, it also applies a certain like design pressure on engineers to decouple their services in a manner so that the failure of a test run in production on a given instance of a service does not lead to like cascading failures or user impacting failures of other services right? Designing data models and database chemos to be like non-idempotent requests right? Especially rights, baking all this stuff into your, into your, your tooling. And in the way you you write code is is it, is it is life changing. And then there’s a what they had to say about lease releases and often people say deploys when they mean releases.

And they say releases when they mean deploys less often. But if you’ve been around the block a few times and seen some software get just like deployed a few times. You probably have some scars from it, which are legitimate. I would never take that wisdom away from you. The ability to like safely release code is one thing. But then the ability to it in a controlled manner, like expose like increasing waves of people to it, it is a different issue. And things that come into play here would be feature flags would be like Canary groups, you know, rolling groups of users. And you need to worry about this, both of the front end sake you know, that UI UX, maybe you want to like start with.

Can you regroup that you selected because they’re friends of friends and family, or you know, their internal users or something you wanted to quit them first. If anybody knows this new thing then you want to like roll it out to other users. There’s also a version of the story that, that you can use to the backend, because you know, say you’re deploying a change that you know is going to be hitting the caching layer a little bit harder. It’s not enough to just do a canary of you know, 10% of your hosts and then like deploy the rest. That’s not even enough to like deploy slowly up to 50 and then turn the right.

No, it’s actually that last 10, 20 30% where you’re most likely to encounter any shall we say edge cases right? So we’re just getting a lot of different kinds of controls here. This has been being referred to by Monk chips and others as progressive deployment. And it was really itching for a name. So I’m delighted to see that it is acquired one. And then there’s Post-Release, these are often called Experiments and I will not lie. I have done a string substitute from Test in Production to run some experiments more than once to get a buy in from other teams or higher ups. And the thing you remember like after you’ve shipped your code is that it’s broken, it’s already broken. It’s broken whether you know it is or not right? Like your distributed system exists in a continuous state of partially partial degradation. And that’s the best case scenario.

So this is where we get into stuff like you know, cast engineering, fault injection, AB tests you know, all these different kinds of experiments. And Cindy had this really great little list that I have kind of copied some stuff from. Here about just like what we’re talking about. So there’s a wide range of risk profiles here right? A wide range of types of tests. And you can see right. How, how by confining yourself to you know, everything that happens before you hit merge, look at this incredibly rich world of powerful tools that you are starving yourself up for.

You know, there are so many things that you’ll never going to encounter in staging. You just won’t, like even the gold standard, you know and I will, I will point out that the closer you get to laying bits down in disk, the more paranoid you should be and the more pre-production testing you should do. So for example anytime you’ve got a database major version upgrade you bet your ass, I’m going to do some offline testing.

In fact, I have written this particular piece of software, not once, not twice but three times for three different databases for accomplishing major version upgrades safely where all it does is just sniff 24 hours worth of traffic to the database. And then you know, they’ve captured it and then it replays it, against a snapshot of the database that was taken at the time that you began sniffing the queries. And we then just run it all, head adjust certain OBS like concurrency and you know, transactions and number of like know whatever, just to see like, is it faster or slower than my old version was for this workload? Nothing else matters for me, for my workload, which means that no, like standard off the shelf bench.

Bench test, benchmark testing tool was ever going to work you know, gold standard. Even if you had that for your entire distributed system, it would still not solve the Michael Jackson problem where one day Michael Jackson is alive and the next day he’s not and you couldn’t predict it. I assume it’s just, you need to not kill Michael Jackson. You could have predicted it. You probably shouldn’t try. Like, this is where you start to just see like the, the wisdom and just giving up control right? It’s very thin. It’s very Buddhist. It’s also just what you do when you’re exhausted from drag.

Just like I give up, I can’t predict what’s going to happen. I officially quit trying that is and I will just learn to get better at handling whatever the hell does happen right? And therefore you know, from all of that, we got microservices. So you could say it’s a blessing and curse, I don’t know. You know the answer, you should experiment and learn where you can’t under control the experience conditions. This was Sydney’s super cute graphic. And I found like the Twitter thread and I always referred to this as like the fourth trimester. Which is this term from, you know, evolutionary biology about how human babies are born. So weak and helpless and dumb. And they just stand there squalling all the time and breaking constantly.

And he got all these terrible failure conditions. So I’m just gonna run around like picking up after them all the time they grow they slowly like stand in their own two feet. And that’s your, that’s your code is what I’m saying. Your code is I couldn’t really newborn baby that shitting itself all the time and needs you to run around after it cleaning it up, it would be incredibly rude for you as a new code parrot to just go, okay, have fun. I’ll be back with another code, maybe in, you know, a week. Like that’s not cool. They didn’t consent to that. You take care of your own code baby until your code but baby can sleep through the night and wipe its own ass. That should have just been the talk right there.

That’s enough. Nevertheless, so that time. So let’s talk about some things that people say about staging the dry me batshit, naive staging questions. Why Test in Production? when you could just be testing in staging? Shouldn’t you always like default on the side of adding more confidence by testing and staging though. Isn’t that just the mature responsible same thing to do or my favorite? We prefer to find our bugs and staging, not production. Oh gosh shoppers. This is like the, I call it the more staging is always better slash safer bias. And it seems to reign unquestioned, but very stupid, sorry.

That’s not very nice, unexamined people live in an unexamined life over there. In fact, anyone who’s spent any time at all with stagey merchants knows that you can decrease confidence by running it in more staging environments. Just as well as you can add confidence. Because very often, very often let me, let me emphasize very, very, very very often something will break in production and will not break in staging. And just as equally likely things will not break in production and will break in staging. And you can choose to spend the rest of your life chasing down differences between the two environments and realizing every time that it was, oh, the instance type. Oh, it’s the network. Oh, it’s the fact that I can think of that was different.

Oh, it’s the environment. Oh, I ask you to ask yourself what’s any of that? Moving the business forward. Another claim that I find naive and laughable is we keep steady in sync with production. It is a representative environment. That’s why we taste testing staging so that we can be sure to which I would say, no staging has far more and competent with your laptop than it does with production. And if you’re a fan of mythical creatures, maybe that explains your devotion staging environments. Staging facts of life, non prod environments will never look, feel, or behave just like prod.

Each additional environment will surface as many novel bugs as production does. And you will have to repro on all of them. The best way to find, assuming what you care about is bugs in production. The best way to find those bugs. The only way to find most of those bugs is to consistently practice observability driven development and let slash allow slash for slash you know, two sides of the same, same point. Allow developers who own their software all the way out to watching users run it in production. So nothing to do with state departments but ultimately you’re gonna find most production bugs in production and… And here’s the thing I’m not saying staging is worthless.

I’m not saying that at all. I’m not, I’m not trying to get you to stop using staging. I’m not, I think it has some really valid use cases. What burns me is when I see production, get the leftovers. When I see engineers slogging away for days and weeks, trying to get these environments to match up, trying to figure out the differences in the environments, trying to bring up another stage environment, trying to bring up. I shit you not a staging cluster for each developer, the quantity, 100 of years of developer energy that have gotten into these, these stupid things over my life.

Baffles like bottles, the environment, the imagination. And but then like we come down to like building some guardrails into production, building some better tools for production, getting some sweet sweet software engineering energy to like really tend to production. To shrink that seed to fulfill the promise of continuous delivery. And it’s like, oh no, we don’t have the time. Oops , that’s not gonna fit into our sprints. That’s not one of our goals that we’ve, we’ve spent all this time. We have lost it, gambling the dice on staging. So we’ve got nothing less to put in our savings account in production.

All I’m asking is that you reverse the order of importance. All I’m asking is that you start taking production seriously. You start putting it first that you ask yourself, how can we give our engineers what they need to really understand what their code is doing, while users are interacting with it in production? What do you need? What kind of visibility do you need? What kind of observability do you need? What kind of instrumentation must you do? What kind of tooling to adopt? What kind of shiny young upstart startup named after, sorry, doing it really badly whatever just like, know what you want.

Have ideas, try to be unafraid to fail, the unafraid, to sink that kind of devotion and do production in feel that it is yours. You know, I feel like sometimes software you just haven’t really made the leap to like embodying that personal identification with it being done well right? ‘Cause it’s opposite job, you know and I actually care less about you actually getting paged and woken up dear suffered years. And I care more about you feeling that ownership in that concern in your gut, it needs to be your baby. You’re squalling naked baby on the floor because it is. It’s your baby. Where was I? Right, production.

All I’m asking is you get the crumbs the staging because of confidence is what you’re looking for. Production is how you get it. And I just read it through the entire, like second half of that. You have your constraint. The engineering cycles is the scarcest recent resource in your firmament. I know it’s my scarcest resource too. And it can feel galling to spend lots of developer cycles on what’s feels a little like navel gazing, but it’s not it isn’t an investment account in the Gloria’s first national bank of technical debt going faster and safer right? That’s what we learned. So also like Kitty McCaffrey, gave this great talk of people love, there’s a link there. Where she showed that you can catch 80% of the bugs at 20% of the effort. And you should, where are you gonna catch him? Use your energy and staging. I started writing something here and I realized it was almost a haikus that I needed to do a haiku staging.

It brief note on observability because. The shift from you know, these tightly controlled staging environments too you know, the more loosey goosey, you know, production also it tracks our industries shift from monitoring to observability, which is the direct consequences of our shift from known unknowns to unknown unknowns, right? Once our architecture, our infrastructure stops looking like that lamp stack on the left and starts looking more like that national electrical grid on the right. You’re inviting so much chaos to live in your house that you just have to seek control. Like you just have to keep your sanity right? You just have to go, alright, it’s yours.

It’s a, it’s a simple function of complexity, but that means you need real observability right? To do it. And just like in case engineering is a form of testing in production. I’ve been saying for ages, like if you don’t have observability all, you have this chaos not the engineering part, just the chaos. The reason that you need observability with these very specific. And I’m trying to be very specific because it’s not about this vendor or that vendor, this it’s about can it do the job? And if you can’t break down by high cardinality dimensions if you can have high dimensionality, what you can’t do at a lower level is compare this test that I just did this experiment that I just ran this cast they just injected with the baseline, right? You have to be able to compare those exact, exact rows with baseline rows and see exactly what is different.

And all of the things are 50 different things, different about these areas than, than the baseline I need to know. That is one thing different. I need to know that. These are the things that will get you that it is life changing. It is a great leap forward. It is what allows software engineers to speak to systems, the language that they understand. The language they speak all day long, that the language of variables at functionings and API Endpoints. you can’t expect software engineers to like translate to low level systems. And you know, this not, you know, all the stuff the slash product, everything will stop printing.

You just can’t and they shouldn’t have to. they shouldn’t be able to integrate, interact with their code at the level of observability, to be able to ask the simple questions. I just injected this test. What happened? Right? High cardinality is not a nice to have must be able to break down by like one in a 1,000,000 things and then break down right? Okay. So you get the picture. I’ve read more about this. I just want to emphasize, if you try to do this with a monitoring tool, you will be sad. You will not get the intended effects. So cause, cause again, you’ll just be firing off very sophisticated tests and then driving down the road with the blindfold on, because you can’t see you as doing right. I’m wrapping up Testing in Production.

Why I think you should care about this. This is just one, one, you know, one side of the elephant you know, the industry wide shift is like the center of gravity. Gravity is swinging towards production for everyone. It’s hard to get right? It is advanced. It is not as easy as running a lamp stack was absolutely will not argue with you there. I will also not argue with you about, you know you can’t just drop things, things on people. It has to be a process. You have to do it step by step. You have to have consent. You have to have buy in. You have to have excitement. You have to have results.

But if you can get here, it’s so much better for everyone. Like the competitive advantage of being able to move this quickly to not have to retrace ground over and over, going back to like fixing bugs and like, you know redoing work and like refactoring. So, you know, it makes all the difference and it’s hard to explain until people have seen it. And so I’m just gonna assume that I’ve made my point and move on. If you’re on a team that you do not, that is not high performing.

As you know, according to the door metrics that agree me, I would be antsy if I was you, I would be trying to get out of there and find a place that could bring me up to a higher level. ‘Cause that’s how that shit works, sorry. Do you treat your deploys like the mission critical product that they are? This is another really important thing. Deploy code is production code just like every feature that you write. Every manager should watch your metrics, they are accountable.

Every software engineer should be on call. That’s my personal belief. There are ways to accomplish ownership and you know, responsibility without, without that necessarily they’re not as good or like the whole the whole players that hookup really tight feedback loops that are not lossy right? So that the people who have the power and the context and the ability to change something are the people that get the alerts and they make the changes just like that.

And everybody’s happy right? Not it takes months. And it’s baked into being the new normal. This is table stakes. Like if you don’t like this, if you’re a software engineer don’t go work on a 24 seven available service, easy. Plenty of those. It’s also a question of training, bringing everyone along right. Don’t just drop them in the deep end and go.

So I’m sorry right? Does everyone who has privileges should know what normal looks like? If you’re only looking at your, your metrics and your telemetry when things are bad, you don’t know what normal looks like? Everyone should know how to deploy? How to get to a known good state? How you know do this in a controlled way? So that you have fine grain, you know, knobs around canaries should know how to debug in production. She noticed share this knowledge with her coworkers.

If you must have staging environments are you monitoring them? Why do people sink so much time in these environments when they can’t even tell if it’s. Here’s your cheat sheet. If you want to get better at this. This is how to have a high performance team and elite performing team, either all a good use of your time for almost every definition of to collective you in green tonight, we get some back. Thank you. (upbeat music)

Learn more

Discover the getting started guides and learn about Pulumi concepts.

Explore the docs →

Pulumi AI

Generate Pulumi infrastructure-as-code programs in any language.

Try Pulumi AI →