DevSecOps and Secure Incident Response

Session Information

Let’s talk about security in an organization. Most commonly, security sits at or after the last phase of the software development life cycle (SDLC) and can make or break the decision to release into production. Unfortunately, waiting on such decisive feedback until after something has been built frequently results in needing to make changes after it’s been marked as ‘complete’, which is costly and inefficient. Instead, let’s learn from how we created shorter development cycles - instead of making Big Decisions at the very end, make smaller, iterative decisions throughout the entire journey that are easier to implement or reverse. One way to do that is by implementing DevSecOps, which adjusts the workflows of development, operations, and security so that security decisions are made on smaller scales at every phase of the SDLC. As with development and operations, even with preparation there can still be incidents - in this case, security incidents - so I’ll also be reviewing our 14 Step Secure Incident Response process, including the what and why of each step.

Presenters
  • Quintessence Anx
    Developer Advocate, PagerDuty
  • Hello, my name is Quintessence and I’m a developer advocate here at PagerDuty. And I’m gonna be talking to you about DevSecOps and Secure Incident Response. Most of my slides are very text light intentionally, but any slide that has resources that I’m directly linking to, will have this handy little link icon, and then I’ll be linking to the entire page at the end of my presentation. Let’s take a look at how software requests are handled now. Probably looks really familiar.

    You have either a feature or a bug fix or something coming into a queue. So if someone’s doing a submission process of some kind, it gets entered in and then it gets triaged for where it’s going to be worked and when, and it enters into a very familiar life cycle. So we start the planning, we do some design, we start producing some code, do some tests and so forth, but right towards the end of that life cycle, we vaulted over the wall to security. And this is what happens when we in the normal process in the now process, when we start saying, okay, well, we’re done, so we’re going to have security, review it and see if there are any issues that come up. And because they’re towards the end, usually there are, what ends up happening is you get this volley procedure where you go through the entire loop again, you get through the testing, et cetera, for anything that they kick back, you kick it over.

    There might be new things introduced so they could kick it back, back and forth. And this introduced a lot of frustrations on both sides. So on the one side on the Dev and, or Ops side, you usually have something that at least in your mind, you’ve conceptualized as done by some definition of done. And what you may not realize is you’re probably expecting when you send things over to security, that they’re kind of kinda check a box and be like, yes, complete. And everybody goes out and has coffee or whatever you’re going to do when you’re done with whatever massive work you’ve just completed.

    But unfortunately, because they haven’t seen it, you rarely get that done, and what ends up happening is when they send things back to you, that you’ve kind of internally felt is done, and they say, it’s not done. You’re like, what do you mean it’s not done, it’s done. Well, on their side, they feel like they could have told you sooner if they had been asked sooner, right? Because depending on what they find, it might’ve been something as early as in the design, it might be a dependency that was chosen. It might be an image that wasn’t scanned, something like that. And they would have been able to provide more feedback if they had been asked.

    And so there’s a lot of frustration here, and this probably starts to sound a little familiar to frustration that a lot of groups specifically Dev and Ops had some years ago that we actually tried to resolve with Dev and Ops, right? There was a lot of friction between these two groups because they needed to work together but their workflows were not combined in a way that was helpful to them. And kind of getting ahead of myself a little bit here. If we start to think about security, needing to work with Dev and Ops, kind of in a similar way, what can we introduce to make these workflow streamline? And the answer, and what I’ll be talking about today is DevSecOps. And what is DevSecOps? DevSecOps stands for development, security and operations. And it seeks to integrate the security across the software development life cycle and streamline the workflow of these three groups.

    And to be very specific, DevSecOps is not a few things. It is not replacing security with development or operations or expecting development or operations to become security specialists, or expecting security specialist to become development or operations. And that’s a mouthful. But succinctly, and kind of what I indicated before, what we’re really trying to do, is do for development, security and operations, what DevOps did for development and operations. And the way that we do this is by a couple of mechanisms, the secure software development lifecycle via shifting left, which is a phrase that you probably have heard a few times before now.

    The idea with the entire process is to break down the barriers between these groups. So similar to before, you had a Dev silo and an Op silo, and that was not ideal, and I have a streamlined workflow. So now we have Dev and Ops kind of in a silo together now and security outside in their own silo. And when you start to look through the entire, through my work process, it looks a little like this, a very simplified diagram, and you’ll notice something a little bit different here. When we’re looking at Dev and Ops, when you unified their workflow, it really just kind of comes together in the middle, right? So you still have a very strongly Dev heavy side and a strong Ops heavy side, but we’re looking at security, it actually needs to hug all the phases across that life cycle.

    If you wanna be a little more prescriptive, here’s one example of a secure development life cycle. And just to be very, very, very clear, this is one example, right? Yours may look different, somebody else’s might look different from yours. Everyone has different requirements, but I really wanted to show something that was a little more specific than this. This specific secure software development life cycle actually comes from the six pillars of DevSecOps. And you can see here, there are different activities to be done at every phase.

    So you have things in secure design, coding, you have testing, you can do things in your CICD pipeline that are security centric, as well as in your deployments pays as well, runtime and monitoring. And to look a few of these, you have the secure architecture and design threat modeling was on there, which is an activity that I’ll talk about in a minute. So I won’t get to have myself there, but we also have the SAST and DAST testing, which are static and dynamic application security testing. Some of these can be automated, some of these really cannot be due to their duration. You can also have scanning of images and dependencies to make sure that they’re not introducing vulnerabilities, that you weren’t aware of.

    Buzzing, which tests your input. So if you have an application where you have a form, if it was requiring, let’s say a date and you put in a binary, does it crash? Does it try and run it? Who knows, right? So all of these things are things that security can do, and you’ll wanna work with them to see what needs to be done with whatever products or services you’re building. And just really quickly, the reason we call the shifting left, you look at this type of diagram, shift left just means do earlier. You can see as we go leftwards, and a diagram that’s left to right, that just means we’re doing it earlier all the way at the beginning of the design phase. Something that’s very important, and I cannot stress this enough, do not try and do this yourself from scratch, starting from zero will be a very painful experience.

    Even if you have some pretty experienced security people on your team, there is a ton of information available online, there are frameworks you can work from. You do not need to implement the frameworks line by line, unless there was another requirement industry or otherwise requiring you to do so. But what you can do is take a look at these frameworks and figure out what works for you. So that again, you’re not starting from zero, you’re starting from somewhere. And some common frameworks for this are the BSIM, DSAM and SAM and links for these are provided at the end.

    So I’m not gonna belabor this too much, just wanted to be very clear, that again, you don’t have to start from scratch, you can work with your team, meaning you can work with others to get more information. Now let’s talk about the, how? Because when you’re thinking about doing a cultural initiative, you’re gonna make changes into your organization. You’re gonna lean in on that sweet, sweet, cultural support, which means that you’re gonna have to work with humans, which is always an interesting time. And when you’re thinking about humans, you need to keep track of who you’re working with, right? Because you have the humans that are gonna do the implementation, people who are actually gonna be practicing the cultural values and, or making the changes, people who write the code, for example, and the people who make the decisions and do broader strategy. And one common analogy we see in the space about things like this, is the blunt end and the sharp end or the pointy end.

    If you wanna say stick them with the pointy end, I really just wanted to quote, “Game of Thrones” for you, is really what it gets down to, okay. But for the sharp end, you have the high risk, low power, and these are the people writing the code or doing the bills or whatever their task is. And what that means is that they’re doing the implementation, but they’re not making the decision in the broader sense. Whereas at the opposite end, you have the blunt end, it’s kind of guiding the sphere, the spear, to say it more correctly. And what that means is they’re not doing the implementation.

    So they probably don’t even know what the code base looks like, but they do know what they want. And this is usually the higher levels, the execs or the managers and so on. In long-windedly, it’s all about getting exact buy-in. So if you need to make this type of cultural change and you need to start integrating security into your life cycle, you need to make sure that you’re talking to the different management or exec layers so that you can get their approval because they’re the ones with Yanay power. Also, they’re the ones that are gonna be approving things like when you’re planning out and staging out work, if they’re expecting to see a certain return on that work or a certain number of features being produced or whatever they’re measuring or looking at, they’re going to see fewer of those when you’re spending time on training and changing the way that you’re making that work flow.

    And so what you want to be able to communicate to them is this is what we’re intending to do, this is why you’ll see this short term. I’m gonna say loss, it’s not really a loss, it’s just going to be a reduced velocity. And then you’re going to explain the longer-term benefits is a gained velocity, but also more secure product, increased trust and so on. Going into the ICs, we wanna talk a minute about not tricking staff. This is actually a policy we have here at PagerDuty common example, most common example is phishing emails.

    So if you think about it, when you have internal companies that send out phishing emails, usually the way this works is they send out an approved one, and if you click it, you either get an email that says, gotcha, or you get signed up for our training or both, and it’s not necessarily too huge a deal unless your organization makes it so, but what it does do is it trains people not to reach out to security if there are other problems, if they click on something or if they notice a vulnerability on their machine or something, because you kind of trained this expectation that the issue will be solved or won’t be solved immediately. Instead, it’s gonna be kind of pivoted and you’re gonna have to do these other things, and they’re kind of punishment oriented. And what you kind of wanna do instead is teach people how to recognize the exploits instead of punishing them for falling for them, because really with how many people are online and how many things that we get, like over the course of the pandemic, since I’m picking on phishing emails right now, and there are people sending out phishing emails about stimulus checks and COVID vaccines and all sorts of things that are really predatory because phishing emails are predatory. But when you have the security organization kind of mirroring that, that’s the mindset that the rest of the staff is gonna get. Especially if they’re not really adjacent to security at all.

    What you can do instead, is teach them how to recognize them, and what this will do is teach them to kind of trust you as a trusted advisor kind of role, and they will start to reach out to you when you start to teach them about other things. And speaking of all that teaching, you wanna make sure you’re doing trainings. A lot of us have probably attended mass market security trainings of one flavor or another, and they are still useful just to be clear, right? It’s better to have something other than nothing, but it’s also not very tailored. So you might find yourself in one of these mass market trainings where you’re getting trained a lot on one specific topic, maybe it’s setting up MFA, maybe it’s a phishing email thing, maybe it’s don’t download attachments, whatever it is, and you get this really heavy training, but it doesn’t actually touch on things where you personally, or your staff collectively are weak. And what can really help with that is if you have a security team that’s able and has enough staff to maintain this, they can actually look to see, where are people strong? Where are people weak? You can customize a training to actually kind of be very light, light touch on the things that people find easy and dive deeper into things that people find hard.

    Relevant to this, you’ll notice that there’s a little link in the corner. We actually have our kind of sanitized version of our internal security training up on our website under the Apache2 license. So, kind of going into that, please don’t start from scratch. You can also clone this, brand this, update it, make it fit your own organization and et cetera. And in fact, we strongly encourage you to do so.

    Something else that we’d like to talk about on the security side is full service ownership. So briefly full service ownership is when you own the lifecycle of a service you’re working on. What can help in this case is if you have security on a security relevant service, quick example, if you’re using HashiCorp vault, something like that, and they can maintain that service in production. And what that will allow them to do is get some visibility into why Dev and, or Ops are sending requests to them the way that they are, because they’ll start to know what expectations are coming down on those teams and can start to anticipate them. Something that can help development and, or operations get more security conscious, or games like, “Capture the Flag.

    " And the idea here is you might have like a file in the root directory and the contents of the file or the flag, the digital flag, and so you need to be able to access that file without doing something simple, like just switching to reviews, or you need to actually do something like a privilege escalation or whatever the exercise calls for to capture the flag. Now, the goal of this is to actually increase how security where you are, so that you can start to understand whether or not an exploit is easy or difficult to do. And then you can keep that in mind when you’re writing your code. Threat modeling is something that all groups can do together. And in fact, you might actually want to include product as well, because they’re gonna also need some security awareness or some awareness into how the product is being developed.

    And the idea is that this group sits together, and when there’s, let’s say a feature or a major upgrade or a major fix that’s happening, you start to model out what risk it introduced and how it introduced it. So if you’re gonna start switching from an external payment platform to an internal one, and now you’re going to be holding payment information, what does that introduce? How are you designing it? And so this threat modeling exercise can really help because there’s gonna be massive cross team collaboration here. And of course, once you go through all of these things, you’ll never have another security incident again, just kidding. And because you will have security incidents because it’s about minimizing them and making sure that they’re not as severe rather than eliminating them, although that is also a goal, I wanna review our secure incident response process. And this is our 14 step generalized process that we do here at PagerDuty.

    The steps may be a little bit different for you depending on the type of incident and their order, but let’s go through these one by one. So the first thing you wanna do is stop the attack in progress. So for example, if someone has gotten behind a firewall, if that’s the case, you wanna make sure that you cut off their access and stop whatever they’re doing, stop the download, stop whatever they’re reading. Then you wanna cut off the attack factor. So if there’s a compromise token or whatever you want to make sure you do that rotation as quickly as possible, an analogy for this is, if someone broke into a house, your first step is to physically remove them.

    And then the second step is to lock doors and windows and stuff, so they cannot get back in at least not via the same route. And so that’s what we’re doing with these two steps. We’re getting them out and then preventing re-entry. At this point, you’re going to assemble the response team. Now this is a little different from a non-security incident, because if you recall a more standard or a non-security incident response process is to assemble first, sometimes even to determine who owns that incident or owns that service.

    That is not the case here, because unlike a traditional like outage or latency, data could be read copied, compromised in some way, you really need to cut that off first, before you start to assemble people. That said, the caveat here is if it’s not a quick fix, you want to assemble people to cut off, right? To cut off access. So make sure you choose whatever is appropriate to the situation. Once you’re there, you want to isolate any effected instances, servers, incidents, VMs, databases, or whatever is being touched at the time, and then you’ll wanna work on the timeline of the attack. So the attack, as you see it today, might be longer than it appears.

    So it might be longer than the five minutes that this has been going on, or however long this has been going on. If you, for example, have discovered that the cause of whatever’s happening and a vulnerability or a CBE, that’s been introduced in your environment, they could have been using that vulnerability for a long time, an hour or a month or more depending, again on the timeline. So you need to start working through and see how long they either definitely had access and, or possibly had access. Then you need, if there is a data breach, to identify compromised data and assess the risk to other systems. So the idea between these two steps is if you know that they were copying data of a very specific table, but if they had access to that table, let’s say to another one that’s on, in the same space or in the same scheme or whatever they’re doing, then they can actually be copying that or have at least read that.

    So that would be a risk to that other system, the other set of data. And so when you’re at this phase, you’re also going to be assessing the risk of re-attack. Again, if some vulnerability was introduced into your system, if it’s not something you can personally patch or fix, that changes how likely it is for it to be exploited again. We will also want to apply any additional mitigation and additions to monitoring, et cetera, and this is super important if you were notified by like a human response rather than your monitoring system. So let’s say that you just happened to notice a massive data copy or something like that, but no monitors went off for that case, you’ll wanna make sure that you update those monitors to handle whatever thresholds you think are appropriate and check for log-ins that don’t make sense and so forth.

    You’ll also wanna make sure you do a forensic analysis of compromise systems. So earlier, if you had a data breach, you would have put everything in read only mode and kind of locked it down so that there could have been no other changes. And now you’re going to be looking at, to see now that nothing’s been changed, do some forensics, you may need a third party for this to see what actually has been happening, if anything was it just copied, was it changed? What went on here? Now towards the end and only if necessary is when you send out an internal communication. And this is very intentionally towards the end of the incident. And another difference from a non security incident, you may recall that we recommend that in a non-security incident, you’re sending out updates every 30-ish minutes, because people are anxious wanting to know what’s happening.

    AWS and such, they have whole Twitter pages and status pages devoted to this where you’re sending out communications that say, there’s still an average, we’re still working on it. Rebooting cluster, communicating whatever is relevant, right? You don’t wanna do this for a security incident for a couple of reasons. One, you don’t really alleviate anyone’s anxiety by saying, someone’s still copying the data. We don’t know why, every 30 minutes, but the other thing is you don’t really have give anybody, give anyone anything to do, right? So you’re not telling them, okay, you’re, this data’s compromised ‘cause you don’t necessarily know yet. If you’re doing it throughout the course of the incident, it’s entirely possible that you get to the end of the resolution phase, and there’s nothing that anyone needs to do.

    They might not need to be notified, there might not have been a data breach, et cetera. So extra communication isn’t ideal in that case. And the other thing to be aware of, is if you’re not sure, and until you are sure where that attack is coming from, it could be coming from internal, you never know. So one of the people I spoke with, while building this presentation, said that he likes to assume worst case at the start before you get everything locked down. But then once everything is like cut off and getting ready to be analyzed and looked at, you want to assume best intent, because usually, especially if it’s internal, people don’t mean to trip whatever alarm they’ve tripped, they just weren’t aware.

    In the more severe cases, especially if there’s any external or any sort of compromised personal data, you might need to involve law enforcement. You might have a requirement for this regardless. So if you do, this is the stuff you do it, when you have all the information in one place, you also wanna make sure you reach out to any external parties that may have been used as attack factors. This is only if you have the appropriate contact, don’t send it to their hello app, it’ll look like spam. See previous slide, right? Make sure you know who you’re contacting on their security team.

    If you don’t, you might wanna let law enforcement proxy that for you. But for example, if you have like a public computer at a university or a library that was used to attack yours, that institution does need to know, because now it’s a security incident for them. And at this point, and again, only if necessary is when you send out the external comms, the customer communications, we’ve all gotten those emails where payment information compromised, data breach here, you do not have to do anything slash the was anonymized or whatever they need to communicate. And this is really just what data was compromised, what data was not compromised, and if you need to do anything. And just as a quick recap, here are the 14 steps, so if you want a screens cap slide, this is a good one.

    And all the references with those little links, so the training that DevSecOps guide that this is based on, the security incident response page, all those frameworks I mentioned and so forth are all on this page. And with that, I hope you’re having an amazing conference and I will be around for questions. Have a great rest of your day.

Get started today

Pulumi is open source and free to get started. Deploy your first stack today.