The Microsoft DevOps Journey (so far)
“This will never work here” is the sentiment we often hear from companies just starting their DevOps transformation. The good news is that change is possible. In this talk, Sasha will explain how Microsoft moved from the 3-year waterfall software delivery cycle to deploying multiple times a day! Using the example of the live engineering environment for Azure DevOps Services, Sasha will walk through the process of updating older systems, transforming the automated tests, implementing CI/CD, and the major cultural changes that were needed to make it all possible.
- Sasha RosenbaumProduct Manager, GitHub
Show video transcript
Hi everyone, my name is Sasha Rosenbaum and we’re going to be talking about the Microsoft DevOps Journey So Far today. Before we get started, let me just introduce myself. So I started my career as a developer. Then I did a stint in ops. I was an architect for Microsoft for a number of years and then now I’m a product manager for GitHub. And that’s how I’m wearing a GitHub t-shirt—, t-shirt today and talking to you about Microsoft dev-ops starting So let’s talk about why we actually need dev-ops. Well, you know how every company in the world is a software company today and this is not just a cliche.
It’s actually happening, right, you used to be able to just produce car tires and just you know not have any digital presence and be okay. But today if you don’t have a website, a mobile app and something that is very good user experience and stuff like that, you’re going to be disrupted by someone who does, right? But then when we move into software and every big enterprise in the world moves into building software, they discover that hey there’s this paradox, right? We can either deliver something quickly or we can deliver it reliably or at least that’s what we used to think, right? So we said we can’t have both quality and speed at the same time, but actually if we look at the dev-ops research, that we’ve collected numbers for a number of years now, then we see that companies that move fast are actually the companies that also perform better.
It’s like developing a muscle, right? If you are deploying to production every day, you’re actually going to get really, really good at it, versus a situation in which you’re deploying to production and every six months or even every year right, and that’s a really, really traumatic event and lots of people have to work really hard on that. So if we look at the numbers, we see that a deployment frequency is closely correlated with actually having lower percentage of failure on change and faster times for recovery. So all of the good things that come from continuous integration, continuous delivery and in general like dev-ops practices.
Then if we talk about Microsoft, why do we actually need to talk about transformation? I mean Microsoft was doing pretty well, right? Microsoft was one of the only companies that’s been on a top market cap list of S&P 500 for the last 40 years, that’s insane, right? So why did we need to transform this? Well, this comic is drawn by someone who used to actually work at Microsoft and it looks uncannily true for many companies but also looks true for anyone who worked at Microsoft a few years ago, right? This is what it used to look like. All these orgs actually kind of were out everyone for themselves, right, and people didn’t trust each other and didn’t want to rely on each other’s sort of tools. But, because if my bonus is sort of tied up in your team’s performance, then you can just you know go away and not work on my goals and then my bonus gets affected.
So as a result of this, basically every team at Microsoft devel— developed their own deployment structure and their own testing framework and everybody was kind of working on their own things. And this was lots and lots of duplicated work, and that was a problem for a lot of people, and so when Satya Nadella took over, one of the things that he said was hey, I would give up working on features any day to just, so we can work on the tools that produce our own productivity. And that’s I think an advantage that comes from having a C-E-O with an engineering background, that he understood that shiny bells and whistles of features are only great as long as your software is reliable, as long as your productivity doesn’t suffer.
So these are internal Microsoft numbers and I don’t update this slide that often. So this I think is February 2020, but when you deploy to somewhere in the world, internally at Microsoft, roughly 82,000 times a day and we have roughly 110,000 Engineers using our own what’s called 1-E-S One Engineering System including like a Azure dev-ops and our own repos and stuff like that, right? So now we’re learning from our own experience. Right? Every engineer at Microsoft is testing your own software, which is a great, great advantage that we can have. And so if we talk about productivity if we take just a swag number, right, 110,000 engineers are using Microsoft One Engineering System. So saving one second a day is saving 3.7 people.
I don’t know what 0.7 means, but basically saving four hires for the entire company. If we save one minute a day on these tools, then we can talk about 163 people that we just added to Microsoft. And if we save one hour a day for everybody at Microsoft, then we are talking about almost $3-billion a year. And these are not the numbers that we use to incentivize engineers. That’s a very bad idea. What’s a good idea is if you show up to your leadership meeting with these numbers then really people started buying into hey, it’s not just about the features. It’s not just about, you know, delivering the latest shiny thing, but also working on our own productivity and better delivery. So I’m going to just really, briefly show you a timeline.
So we started in 2010 with a tool called T-F-S at the time, some people still use it today and it’s shipped on D-V-Ds at the time, right, and it took us about two years to get to the first version of that D-V-D that even worked, okay? Because we started with like it works on my machine, and like okay, barely passed the test, and then compile them and ship, right? And then you needed to hire a consultant to get to even install the T-F-S server because it just wouldn’t work out of the box. And then we basically made a major, major transformation from this and we ran into life-size services, right? We went into software as a service with V-S-T-S in 2012 and we started providing that tool as a service in the cloud and that meant that we had to learn a completely new muscle, right? We had to learn how to actually actually run the service for customers not just ship a D-V-D and then not worry about that.
And then there was a bunch of other things on this timeline. I’m going to go deeply into much of them. But we switched Windows development to Git, to Distributed Version Control instead of T-F-V-C, which was a major kind of crazy thing. We joined the Linux Foundation, which again sounds crazy to a lot of people still. And in 2018, Microsoft acquired GitHub which is how I’m here. But yes, we have, we’re roughly Sprint 177 right now, and we have about 25,000 engineers working for Microsoft and Microsoft payroll contributing to open-source projects, which is a crazy thing for a company like Microsoft that used to say that open-source is cancer. But okay, let’s go into how we actually made this happen because it’s not an easy transformation.
So we are going to talk about, for instance, creating clarity. So how do you actually align on your goals in your organization? Well, so this book comes from John Doerr, he talks a lot about Google actually, but so basically O-K-Rs are all the rage today and what O-K-R is objective and key results. So you start with an objective, which is like grow a strong customer— happy customer base, right? So you define what your goal really is and then you go into key results which are, so typical key results, typical good key results, are something that you can measure, right? So you have to be able to attach a number to it and that has to be a sort of objective number as objective as you can get it. So something like net promoter score, customer satisfaction, or queue time or something like that.
But don’t say we are improving things, right, improving is very vague and you can’t really measure that. And so if we’re talking about like measuring results, not activity, so for instance, measuring activity is we are— we’ve have published, you know, we delivered five new features this month okay. Well you delivered features, is anyone using them? Right? So a better metric is our customers are happy with the features. Now, you have to ask yourself how you measure that customers are happy, but that’s a conversation that we’re going to go into a little bit. And then another thing is, there’s committed K-Rs and there’s aspirational K-Rs. So, when you say that everyone is committed to delivering everything 100% people start sand-bagging, right, because I want to be committed to delivering every single aspirational goal that our organization has, because I don’t want to lose my performance indicators and stuff like that.
So people start telling you that things are harder than they really are. So basically, but at the same time, you do have some committed K-Rs, because let’s say my site reliability is a committed K-R, I strive to 100% availability. Maybe I won’t hit that, but I definitely need to strive to that every single day. Right? So we have aspirational K-Rs that you can deliver some part of it and you want to strive for bigger kind of half-stretch goals and we have some committed K-Rs which are we must deliver this. And we also have this product alignment. So it’s kind of starts at the top and then goes into server, and then service, and then it goes into teams, and then sometimes even into individual K-Rs. So basically you have leadership responsible for the big picture, right, a strategic goal is that Microsoft delivers X, Y, and Z, but you have teams responsible for particular details, right? So the leadership doesn’t micro-manage every single team delivery.
So let’s talk about what we actually on developer teams especially measure, right? So we measure usage, so customer engagement, customer satisfaction, customer churn. How many people dropped off, right, how many people are using the feature? Then we measure time to build, self test, deploy, and stuff like that that pertains to C-I-C-D and we measure lifetime health, so like time to communicate an incident, detect, mitigate and stuff like that, S-L-A per customer, because sometimes you can be overall performing really well, but your particular customers can be suffering and you want to look at S-L-A per customer not just S-L-A in general.
More importantly, this is my favorite thing. We don’t watch for things. Some things are very bad indicators. So we don’t watch if you met your original estimate, right? It doesn’t matter and you again don’t want people to start sand-bagging you on this. This is why we don’t look at team burndown and team velocity at capacity, right? These numbers are not that important. And if you say velocity is very important and people then are going to start estimating things differently. So we look at things that we have completed not at how long it took us to complete it because we’re not oracles we can’t predict the future.
We don’t know what feature is going to be harder or easier to deliver. We don’t look at lines of code, my favorite one, right, is lines of code. We don’t measure productivity. We also don’t look at number of bugs found, right? It’s a bad metric because then people start finding all sorts of bugs and it does nothing for your actual customer satisfaction. But you know, you have a lot of bug-hunting going on. So we are customer obsessed. So all these metrics if you notice on previous slide actually come from talking to customers and looking at what customers really want. So basically the evolution of done at Microsoft went through these like very and I think this happens in a lot of companies. I think this is relevant for a lot of people.
So basically we started with, works on my machine, right? I delivered the feature when it compiles and works on my machine. That’s not a very good definition of done. So we got into okay, it merged into development branch, right? So we have a development branch and it actually merged into there which means it works with some other code from some other people. A little bit better. Then it went, okay, it merged into a main branch. So that’s even better right, we are getting closer to trunk-based development and all the good things. Then it passes all the tests. So that’s great. Right? It’s kind of like, okay, we’re there, right? We’re done. Well, of course not, there’s also some other stages. So live in production, then basically we thought we arrived right? Live in production.
We delivered a feature, it is working. Nothing is breaking, great. But there’s one more stage to this which is actually more important, which is live in production and customers are happy with it. So basically it is only delivered when your customers are happy with it and you have to have ways to estimate if your customers are happy with it. Otherwise, you just kind of working in your own engineering bunker and you don’t really know what’s happening in the outside world. So we used to actually say, okay, we know what our customers want and we are going to build it all by ourselves. Now, we do something that’s called hypothesis driven development. So basically we go and we say, okay, so we think customers wants one X-Y-Z, right.
Then we go talk to customers. We actually do customer research. We understand how to develop a feature, how to design it, but then once we deliver it, we actually look at customers, how many customers are using it, how many customers are happy with it? How many customers are recommending it to their friends and stuff like that? So basically all of that is a definition of done and I think that makes Microsoft products a lot better than they used to be. So, um, we also gather customer feedback. So this little bit of ways and sometimes are too many ways. But we basically gather stack overflow and M-S-D-N forums. I think that’s been renamed from a M-S-D-N now, I need to update the slide. But basically we are, um, gathering information from customers talking about our features.
We also gather feedback in the product. So actually what went into these dialogues actually is estimating how much much to do so that it’s not distracting and customers don’t hate you for this. But you basically can report a bug even if you’re in a free tier. You can make a suggestion, right, so you can measure customer experience in your own products. So if I just discovered a new feature and I hated it, I can tell Microsoft about it or if I love it I can tell Microsoft about it.
Then we have something that’s called Customer Champions. So basically we talk to our largest customers and we actually attach like an Engineering Champion to a customer, so that allows us to you know that situation that I talked about when most of your customers are happy, but some customer is not, especially for the largest customers, a big enterprise may not be in the same situation that our free to our customers are, so basically we want to have a person who talks to them, maybe on a monthly basis and actually finds out what their experience is like and can champion for them inside the organization, say hey, you know what? This and this enterprise is struggling with X or they really want to feature Y so please, please, please product team we need to work on this, right, so that allows us to keep in touch. And then also, okay, kind of switching gears here, and I’m going to talk about something that matters also, is the how we build our teams.
So basically, it started with Microsoft having program management, development, and testing, right, and development and testing used to be different things. Right? We decided that it wasn’t representative. So we basically switched it to everyone is a software engineer whether you write tests today or you write code today. You are software engineer. And then basically we have program management, S-R-E sort of discipline, and an engineering. And then the major change was actually combining all these teams into feature team. So instead of going horizontal, and so I work on the data and you work on the U-I, and if the customer is not happy with the U-I, it’s your problem.
We now go vertical right? So everyone who’s working on a feature is in this vertical team and that we find increases the sense of ownership for everybody on the team, to working on the feature, right? And to actually delivering customer satisfaction. So every team has a you know, someone who’s responsible for deploying it and someone is responsible for data, and someone’s responsible for A-P-I, and someone’s responsible for design, and stuff like that and maybe you don’t get a full—, you know full-time designer on your team. Maybe you got one-fifth of their time, but actually designer is part of that feature and part of ownership of that feature to the customer, right? So again it increases the sense of ownership, and people are actually instead of blaming people in the other levels are just looking at how happy customers are with actually the feature they delivered.
And then we do something, so I think I’ve only encountered that at Microsoft. I’ve never seen it anywhere else. Basically, we do something that’s called yellow-sticky-notes exercises. So every now and again, usually not more often than once a year, but it kind of depends on what’s going on. So every now and again we have this situation where product managers come in and they pitch their feature or their work-load to the team and engineers can go and switch teams and choose which team they’re working on. So what we see is most of the engineers don’t switch, it’s less than 20% people switching, but it encourages this sort of self-forming team and if I’m not interested in this and I want to learn some other technology or I want to work on some other problem, then I can switch.
It also encourages people to kind of stay alert and, and immeresed in their work. So this is also something that I’ve seen at Microsoft that I haven’t really seen in other places, which is people are encouraged to change jobs inside of Microsoft. You don’t have to hide it from your manager that you’re interviewing for another team because your manager will support you in taking a new direction because Microsoft understands that people are happy when they’re challenged and so you don’t want to stay in the same job for, you know, way too long. So you basically want to keep moving. So, okay, and then let’s talk a little bit about how to collaborate on code. So open-source is this big thing that happened over the past, you know, I don’t know few decades, and then basically open-source also influenced how we work inside our organization.
So one of the things that Microsoft is currently working with is InnerSurce, so that’s when you allow other people to change and contribute to the code that you are developing. So basically this, this funnel is more for open-source, but it works the same way for InnerSource, basically most of the people that consume your code will just consume your code, right? They will take your code and they will use it and you will never hear from them. Then a little bit of people contribute time. So they will log a bug or they will improve your documentation or something like that. Less people would contribute code so they will do a bug fix or develop a new test or do something like that. And then very, very small amount of people will actually own the project, right? So they will actually be part of your team. They will influence the direction even if they’re on the outside. Often you end up hiring them.
That, that is what happens. So basically, if you want people to contribute to your projects you need to do a few things. So you need to have a really good Read Me about what the project is about. You need to have a really good contributing M-D. So you describe how to get started right? Because if people show up and they don’t know how to get started, they’re not going to contribute, right? So you have to be very, very specific about how to work with your project. And then, you, it’s also really useful to maintain a list of good first issues. So if someone shows up and they want to pick up and contribute they can actually pick up an issue that’s been designated as something simple that they can learn and contribute very quickly.
So if we look at internal Microsoft data, 90% of pull requests originate from the same team. But then we do have about 9% of pull requests originated from nearby teams and then 1% from very distant teams. So this is actually very cool because if you think about it, sometimes as an engineer, I am blocked by a different team, right? So I want a particular feature that another team is responsible for and they actually, it’s not a priority for them so they don’t actually, are not getting to delivering that. Well with pull request and approval process, I can actually go and contribute that feature and they can test my code make sure that it works, make sure that it’s up to their standard, and actually go with it, right? And that enables me to unblock myself, by contributing to other people’s teams.
So basically to do that, we actually did change incentives inside Microsoft. So instead of being rewarded on the sort of, only your personal contributions, we are now rewarding people on contributing to other people’s success and leveraging other people’s work, Right, so when your bonus depends on competition you’re going to compete, but if your bonus depends on collaboration, you are more likely to collaborate. So the other thing, and again start a new topic a little bit, is iterating over pain. So this is a lesson for any dev-ops transformation ever, right? You start where it hurts the most and then you move slowly and you fix the most painful thing and then you pick up the next most painful thing and stuff like that.
So a good example that I like for this is where we were with the test automation. So, Microsoft actually started with tests that, for, so for Azure DevOps product, the tests were running for 24 hours, actually a little bit more than 24 hours. So basically you couldn’t release the service everyday because you couldn’t even run the test everyday, right? It wouldn’t complete in a day. So basically over time we moved to the situation where we actually have 85,000 tests running under seven minutes. This is a real screenshot. So usually when people see that, they’re like, oh my God, how did you do that? Well, it took a long time. It took about three years to get to this state. So basically we didn’t say okay, we’re now going to work only on improving testing, right, because customers don’t usually understand this, right? We basically took a little bit of time out of every sprint to improve the test.
We eliminated flaky tests. We eliminated actually most of the U-I tests because they were very unreliable and they took a long time. And basically we switched it tiny bits over time. And so now we got to this and this is just actually just the tests that run on every pull request, right? So this is not even a full test suite, but it runs in under seven minutes. So basically one of the things that I highly recommend is using pull requests as a sort of gateway to production, right? So you can both review the code on pull request and you can run lots of automated tests on pull request. So basically before your code even makes it into the main branch, you know that you’re verified the coding standards on that code and pull request is tremendously useful to that as a tool.
And then you can have the you know, the approach of only green builds can make it a production. So this screenshot, everything is green, not everything is always green. But only the green builds actually make it to prod right? There’s no such thing as oh the tests were red, but we push them anyway because they’re flaky. Eliminate flaky tests, like this, sometimes worse than no test, right? So we want to make sure that all of our tests are 100% reliable and they’re actually verifying the quality of our software. And then something that kind of goes hand in hand with all the deployment stuff is trunk-based development. So basically when I talk about how Microsoft developed on the main branch all the time, right, and everything goes into production all the time, people like oh my God, that’s impossible, right? I can’t do that. Like I need to have a development branch.
I need to, you know, have a staging branch, run my tests and stuff like that. So I have a question to you. Do you test in production? Usually people, people get a heart attack and stuff. But the answer to this is if you’ve ever fixed a bug in production then you are testing production. Basically all the code that you’re pushing all the time is getting tested by your users, which is not ideal right, you want to test being tested by testers? Or your code being tested by testers not by users.
So basically how do you develop three months worth of a feature and push the production of the time? Well, I’m not ready with my code and how could I possibly do that? The answer to that is feature flags and feature flags are tremendously useful. So all of the code is deployed all the time, right, but you actually have control. So in Microsoft products and GitHub products, we have control on both sides. So basically the user can opt in your feature and also the team on the back-end can turn on the feature, right, so we can choose for instance to turn on the feature for a subset of users or stuff like that. Right? And so all the code is pushed all the time.
This is essentially dark launch, right? We’re dark launching our code all the time, which is way safer than developing something for three months and then pushing it to production for the first time, right, before is getting used, right? And then the other thing that we do is ring-based about deployment. So basically we start small and kind of roll through the data centers, into which we deploy stuff and so for us, we’re lucky and we have internal users. Like I said 110,000 or so engineers at Microsoft working with our tools. So basically we have lots of our own testers.
So if something doesn’t work, we’re going to go, get a call from a team next door saying like hey, you broke a feature, right? Not everyone is this lucky so if your not this lucky you need more elaborate testing. And more elaborate Q-A procedures, but we do push first to internal users and kind of have, we have a waiting period of 24 hours before that goes through the first external customers. Okay. So some more notes and this kind of abbreviated version of some of these, so secure the software supply chain. Security is really important and it’s really hard. We have much more developers than software, than software security professionals. So we need to worry about the code we’re pushing all the time. So one thing we know from software reports is that the more code we push the more security issues we actually introduce, right? So it’s really we’re not learning from mistakes.
We’re actually pushing security vulnerabilities with every line of code. So what you want is to automate this as much as possible because people don’t like to talk about security, a lot of people don’t understand security, right, and again, we can talk about security for another hour, but in general, you want to basically add security procedures into your pipeline as much as you can. Right? So you can have I-D security plug-ins. You can have threat modeling which involves real people discussing features, right? You can have pre-commit hooks and peer review on pull requests and stuff like that. Right? Dependency management is really, really important because most of the hacks today are actually done by vulnerability.
Using known vulnerabilities to attack deployed software. And then you can have, you know infra as code actually is very helpful because if you can rotate your infrastructure, then you have no long-standing servers and when you have no long-standing servers, it’s less likely that the hackers are going to be able to sort of camp in your infrastructure which happens actually a lot. So rotating your software, rotating your infra, rotating your keys as, as much as possible, right? And then of course you get all the way to the deployment and you have pen testing and all sorts of security controls. And again, no one wants to be the next headline right? I know sometimes security is, it seems like a pain but it is really important to invest in it.
And then we want to build for resiliency. So we want our apps to be resilient and highly available, right, so again, we can talk for hours about S-L-As and S-L-Os and all that stuff. But basically we want to have a life side service that is available. all the time. We’re in a very different predicament than we used to be a few years ago. And we actually want to make sure our customers have a good experience with that. So yeah, the point of this slide is being transparent with your customers. We’ve learned that actually communicating stuff to our customers about the incidents works better than trying to pretend they never happen. Okay, so I’m kind of getting at time so I’m not going to dive super into all this.
But of course, the journey continues, so we’re investing remote work now. So this is something that’s not new to GitHub, but very new to Microsoft. Microsoft used to be co-located for the most part. And so now we are trying to work on empowering knowledge sharing between our teams. And one thing that I want to recommend is a book called Making Work Visible. So this is something that I think more important in remote world than it ever even was in person, because you can actually make sure that what people are working on is visible and so people understand the productivity and the work-load and the level of burnout on your teams.
And then cool things that were working on is M-L ops. So basically M-L scientists are struggling with completely new and sometimes the same problems that developers were struggling with ten years ago. And so basically makes us working on releasing tools for making that workload easier. And a journey of a 1,000 miles starts with a single sprint. Dev-ops isn’t magic, if Microsoft can do it you can too. So thank you so much. I’m @DivineOps on GitHub and on Twitter and I hope to see you soon.