How AWS architects APIs for hyper scale

Session Information

With AWS Cloud Control API, AWS introduced standardized APIs with the goal of solving support for the latest AWS innovation through a resource-oriented model, typically available on the day of launch. In this session, we will introduce how we took Cloud Control API from concept to launch, the culture and technical changes instituted internally to launch Cloud Control API, and the importance of engaging with partners and customers early in defining the product, user experience, and development.

Presenters
  • Rahul Sharma
    Senior Product Manager, AWS
  • Good morning, good afternoon, everyone. I’m Rahul Sharma, and I’m a Senior Product Manager at AWS. I’m really excited to be here with everyone at the Pulumi Cloud Engineering Summit, and I’m excited to present to you all a new AWS service that we recently announced, AWS Cloud Control API. As we outline in the abstract, I’ll introduce AWS Cloud Control API in today’s presentation, talk about how we conceptualized the product, architected it for scale, and launched in collaboration with partners like Pulumi. So without much ado, let’s get straight into it.

    I’ll start off the talk with a background on Cloud Control API’s genesis, then introduce to you all what is Cloud Control API and how it benefits users, then walk you through the journey of building Cloud Control API from concept to rollout at scale, then present a demo on using Cloud Control API, and conclude the presentation with resources to get you started. So let’s dive straight into it. Before I begin the background story of Cloud Control API, I wanna take note of the sequence behind our product development. At AWS, we work backwards from our customers, hear their feedback, identify a solution, and then build our products. Cloud Control API follow the same sequence, it was no way different.

    In case of Cloud Control API, there are in fact, two types of customers for whom we were building this. First, builders or developers, who are the end users that build application infrastructure, manage and monitor them, and the second sort of customers were AWS Partner Network or the APN partners, such as Pulumi, who build on AWS, expose their solutions, be it infrastructure as code in case of Pulumi, configuration management, cloud security posture management among others to end users. We identified three opportunities to help these customer personas. So what were these opportunities? The first one corresponded to builders who use partner solutions. We heard from these builders who are AWS customers that use partner solutions to specifically build and manage their cloud infrastructure that they want to accelerate their pace of innovation and time to market for their applications.

    For example, there are situations where there’s a lag between supporting a new AWS release and APN Partner solutions for instance, if, say, an Amazon memory DB resource is unsupported in a partner solution, then customers using that partner tool will need to wait for it to support before they can start using those in memory database services. The open question for us was, can we help these customers adopt new AWS features and services in the form of cloud resources closer to their launch? And that was the first opportunity that we had. The second corresponded to AWS Partners. As you’re aware, AWS continues to innovate on behalf of its customers to help them unlock new capabilities on Cloud. For example, we today support over 200 plus fully featured services, and in 2020 alone, we launched over 2700 significant new features.

    AWS partners want to stay in sync with our pace of innovation. And we learned that it can often take a few weeks to integrate with and expose each new AWS capability. Naturally, our question was, can we automate supporting new capabilities on behalf of partners through our one time integration? Can we have some sort of a unified interface that allows partners to integrate once and benefit from getting support for the latest AWS innovation? So that was the second opportunity. The third opportunity, which is finally what we recognize is that, we have an opportunity to standardize the APIs that interact with all these features and services or latest AWS innovation we’re talking about. And you may wonder why.

    As applications become increasingly sophisticated, developers and builders tend to work across several AWS and in some cases third party services as well by using distinct service specific APIs. While these APIs are descriptive and intuitive, some developers prefer a consistent set of APIs to manage cloud resources across various services. For example, to define an Amazon Kinesis stream as you can see in this example, for our data streaming application, I would use a variety of APIs, such as a create stream API to define stream name, shard count to add tags to stream API, to add tags to the Kinesis stream or even like using the increased stream retention period to define the retention period. At a later time for getting details of the stream resources, I would use describe stream, similarly to create a lambda function I would use a create function and get function to APIs to get the details. What are the most common use cases further that we have heard is, around identifying and deleting legacy resources that were created by customers for the purpose of testing.

    And these resources were created outside the management of infrastructure as code solutions. What we heard from these customers is that they want a programmatic way to identify these resources that were created outside the management of infrastructure as code solutions, cross reference them against existing resources, which are managed through these solutions, and then delete them in order to simplify the way they manage their costs from maintaining these resources. This could be possible in a programmatic way through a consistent set of APIs, right? Such consistent APIs will help these users avoid altering and maintaining custom code to discover and delete each type of resource. That’s a question that came to us was, can we expose a consistent API method to interact with hundreds of AWS services and beyond? So that’s what led to the birth of Cloud Control API. But to summarize, the three opportunities that lay in front of us were, can we support new AWS features and services in the form of resources closer to launch? Second, can we help APN partners automate their integration with the latest AWS capabilities? And finally, can we standardize APIs to interact with hundreds of AWS services and interact third party services as well? These opportunities lead to the foundation for building Cloud Control API.

    So what is Cloud Control API? Cloud Control API is essentially a set of common APIs that is designed to make it easy for builders and developers to manage their cloud infrastructure consistently, and leverage the latest AWS capabilities faster, typically, on the day of launch. It introduces consistent APIs to manage the end-to-end lifecycle of AWS resources, and third party resources as well, which ranges from creating a resource, updating it, reading the state of the resource, deleting and listing, right? Now with these, create, read, update, delete, and list resource, and in this case, we actually call our read resources, get resource API, you can use the same set of APIs to perform the end-to-end lifecycle management for be it an Amazon Kinesis stream, a lambda function, a CloudWatch log group, and ECS cluster, or even third party resources such as data dog monitor, MongoDB, Atlas clusters among others, right? And you may wonder why these consistent APIs helps address the opportunity number three, which we highlighted on consistency, how does it really address the first two opportunity areas that we identified, right, which is accessing the latest AWS innovation faster, and providing a unified interface to integrate once and benefit from all latest innovation? I’ll get straight to it right away. So first and foremost is how does Cloud Control API enable faster access? Cloud Control API uses the CloudFormation registry to expose resources built by AWS service teams and third parties to integrating partners and their customers, right? For new AWS services and features, these are available, typically closer to the day of launch. So any builder or developer using a partner tool that’s integrated with Cloud Control API can now benefit from faster access to the latest AWS innovation. And these builders can also be using these APIs directly.

    Next, unified interface. With these consistent API work that are exposed, partners can now build a unique API code base using these unified API works, common input parameters to integrate once and expose the latest AWS features as resources. You can imagine Cloud Control API as an adapter layer on top of all the underlying services that are supported, and all partners now need to do is integrate with Cloud Control API, and get the latest AWS resources as and when they are supported by Cloud Control API, typically closer to the day of launch. Partners now don’t have to integrate with each new AWS service or feature themselves, right. So that’s one of the other benefits of Cloud Control API, one time integration is needed to keep up with AWS’s pace of innovation.

    And then finally, newly touched upon is, consistent CRUD plus list interface that is designed to make it easy to manage cloud infrastructure consistently, whether it be an ECS cluster, Kinesis stream, lambda function, or hundreds of other AWS resources, or over a dozen third party resources, you can use the same kernel APIs to manage them end-to-end. So now while you are aware of what Cloud Control API is, and how it addresses the opportunity areas we identified, you may naturally wonder, how was this built from concept to rollout at scale? In the next few minutes, and in my present presentation, I’ll walk you through the steps we undertook for the end-to-end rollout. So let’s get straight into it. As you’re aware, each product launch begins with ideation. And so was ours.

    The customer and partner feedback was something we heard ourselves as AWS CloudFormation from our direct users. To address the feedback of faster coverage, we, at CloudFormation undertook a journey starting in 2018, where we switched our internal coverage model from an older and more tightly coupled implementation to a self service mechanism. What do I mean by that? It essentially means that we allowed or enabled individual service teams at AWS to build coverage in a decoupled way to ensure resource support is available faster. We further externalized date at re:Invent in 2019, with the launch of CloudFormation registry, and the CLI, with the registry being a place to discover and consume resource types and CLI being an open source client that lets developers and internal teams build these extensions. This was the ideation, right, like the ideation started with the coverage opportunity that needed to be solved, right? Like how can we help our customers get faster resource support, faster access to the latest AWS innovation? Now while we lay the foundation through registry, we realized we can, CloudFormation is not the only, CloudFormation can be one step forward in terms of helping the rest of the partners in AWS partner community or other AWS Partner Network to help solve coverage in a scalable manner, right.

    So we realized we can solve this resource coverage problem for customers who are not just using CloudFormation, but also using partner tools such as infrastructure as code, cloud configuration management, and cloud security posture management, among others. But the question was, how do we do that, right? CloudFormation can be thought of as essentially three pillars, right, like three layers, rather. The resource provider layer for AWS, for the deployment and orchestration engine that offers various managed experiences on top, which is layer number two, the third being the syntax for specifying desired state of a resource via CloudFormation templates. Our concept here was to externalize the resource provider layer for hundreds of AWS resources across several services. And in doing so, we also wanted to address the opportunity of standardizing control plane API introductions for all these hundreds of AWS resources.

    So that’s where the concept began, right, let’s just externalize the resource provider layer and get the partners and builders start building and accessing the latest AWS innovation faster. While we had the concept ready, we wanted to like have early validation. To test our concept early enough, we gathered feedback from Pulumi, among other AWS partners, internal AWS teams and our customers. Early feedback on the concept laid the routes for designing the product for scale. These discussions informed API design and the concept.

    So once we had these, all these discussions and meetings, for the validation phase, we moved on to designing this product for scale. How did we do this? We did this by beginning, modeling the start like forecasting demand, right, like estimating adoption at scale, and usage for customer at scale. Once we forecasted demand, it was critical for us to architect systems to support such demand forecasts, for example, designing the components that can support traffic from all the regions we intend to support based on our demand, right. And then finally, while we were designing these systems, it was critical for us to keep certain tenets in mind, like safety and security, along with scale and standardization were some of the core tenets, we had in mind while designing Cloud Control API and the system, right. Now, once the design was ready, what followed next was the product development piece.

    And to test the product development in its early days, we released the private beta for Cloud Control API, and gathered deeper feedback from Pulumi and other partners and even customers on areas such as coverage, the API interface, and if there was a need for a console or not, right. Our acceptance criteria was to satisfy or rather address the feedback, the critical feedback ahead of rolling this out at scale, and making the product generally available. We rolled out the product after prioritizing the feedback that we heard from all our external stakeholders, including Pulumi. And there are three areas as I mentioned, for feedback, resource coverage, API design, as well as console, whether this was needed or not. For as far as coverage is concerned, we actually prioritize the support of resources to move from the older and tightly coupled mechanism, which I talked to you about to a self service mechanism on CloudFormation registry, into a registry based model.

    And we ported some of the resources over to have to support it on Cloud Control APIs such as lambda function, AWS lambda function, AWS API gateway stage, among others, right. Similarly, we also made few, we incorporate feedback on the API interface as well. We designed for, we update resources API, by incorporating feedback on implementing standard RFC 6902 JSON Patch operations for update resource API, which I’m gonna show in your, in the demo as well, the way the patch operations work. And of course, testing this out for scale, right. And then finally, as part of our launch announcement, throughout the journey, we continue to invest towards increasing AWS resource type support, right.

    And we launched this product on September 30. And we continue to invest towards increasing support for AWS resources on Cloud Control API, and we’ll have many more resource types in the coming months, including those from Amazon EC2, and Amazon S3 services, among others. Pulumi collaborated with AWS on integrating with Cloud Control API via the Pulumi AWS Native Provider, and you can now access the latest AWS features and services on the same day as it is supported on Cloud Control API. This includes hundreds of AWS resources, and even third party resources. And by building on AWS Cloud Control API, the AWS native provider built by Pulumi, and currently in preview, as most of you would know, exposes the unified resource model for AWS built by service teams.

    By leveraging the AWS Cloud Control API, the AWS native provider for Pulumi builds on the work done by service teams at AWS to define the resource model for their services. So I’m sure by now you must be excited to see what these APIs really are. What does consistency mean, right? So I’m gonna switch over my screen to the terminal to demo Cloud Control API and its consistency. I will do that by showcasing the create, read, update, delete, and list operations across three supported resources, Kinesis stream, CloudWatch log group and ECS cluster. And I’m going to also showcase the case of identifying other discovering resources that are managed outside or other created outside of Cloud Control API, and how you can actually manage them using these APIs altogether.

    Before I start my demo, I am going to first showcase the desired state files associated with the three resources I talked to you about, AWS CloudWatch log group, Amazon Kinesis stream, as well as, Amazon ECS cluster, right? These desired state file consists of the resource configuration or all the properties that are associated with my resource in a JSON file. So in this case, I’m gonna first showcase to you what are the kinds of properties that I’m defining for each of these resources, right, and then go to the terminal to showcase the consistency of create read, update, delete, and list calls across these three resources. So let’s get straight into it. So as you see in the screen, you’re seeing a desired state file for a CloudWatch log group. As you’re aware, an AWS CloudWatch log group is a group of log streams that share the same retention, monitoring and access control setting.

    And a log stream is a sequence of log events that share the same source, right? In this specific configuration file, I have specified the log group name as my demo logs and also specified the retention days, which is the number of days in this case 90, for which I want the log events to be retained in this specific log group. Similarly, let’s go to the desired state file associated with the Amazon Kinesis stream. And as you’re aware, an Amazon Kinesis stream captures and transports data records that are continuously emitted from different data sources or producers, shard count, as well as the retention period, which is the length of time data records are accessible after they’re added to the stream. In this case, if you look at my configuration, I have specified the name of the Kinesis stream, specify the retention period hours, in which case it’s 168 hours or seven days, and the shard count, which is three in our case. And then finally, I have also specified the configuration associated with an Amazon ECS cluster, I have specified the name of the cluster, the cluster settings, which specify container insights, which collects metrics at the cluster tasks and service levels on both Linux and Windows Server instances and the types associated with the ECS cluster.

    So these are all the desired state files. And now I’m going to configure these resources using Cloud Control API, read the state of those resources, list them, update by either replacing certain property or adding new properties and deleting them. So let’s get straight into the demo part of it. So I’m gonna switch over to my terminal. This is just an active directory of mine where I’ve listed all these files that exists all these are desired state JSON files.

    I’m going to clear this. And let me get started with creating the CloudWatch log group. So to create the CloudWatch log group, I am going to specify the Cloud Controls create resource command. In this case, I’m gonna specify the resource type name, CloudWatch log group, and passing the desired state file. In this case, because the state file exists in active directory, I can use the path as mentioned here.

    But I can also specify the same using a JSON blob, which consists of the desired state. So once I hit Enter, you would see that progress event is returned, which showcases that this operation is in progress and also returns a request token. I can quickly identify with the progress of this specific operation. So I’m gonna use an auxiliary API of Cloud Control called get resource request status, copy the request token here and see whether this was operational successful or not. And as you see, this operation was successful.

    That means that the CloudWatch log group was successfully created. How do I check for that? I can list all the resources or identify them in my account and region. So I’m gonna use the list resources call to list all the CloudWatch log groups in my account in this region. I just specify the resource type name, and I’m gonna get an output consisting of the identifier, in this case, my demo logs as I had created, properties, which is the retention days that specified as 90, log group name as well as the Amazon Resource Name or the ARN that gets output as part of this call. Now, if I want to update any property for this resource type, for example, if I want to change the 90 retention days to 180, all I now need to do is use the update resource API call.

    specify the type name and the identifier and pass in the new or rather the patch document, right, like what is the operation change. Once I specify the identifier, all I need to do is pass in the patch, because it’s an asynchronous call, this will return a progress event. And you see that I’ve updated the retention days from 90 to 180. And I can show you how the patch operation is on my text editor. So in this case, you see my CloudWatch log groups patch document, which consists of the operation as replace.

    And the path basically points to the property that needs to be updated, in this case, retention days, and the value to be 180, right. So going back to the terminal, let’s see what is the status of this request. Assuming this is successfully created, I can just read the state of the resource. So I can use the read call, or the get resources call for Cloud Control, specify the type name and the identifier, which I fetched from the list call and read the state of the resource type. And you see the property has been updated, right, from retention days of 90, it’s moved to 180, you see all the other details which are configured and the Amazon Resource Name associated with this resource.

    You can also delete this by calling the CloudWatch log group type name, specify the identifier and the API call associated with delete resource, enter Next, and you see again, a progress event is passed. So as you’re seeing here for create, update, and delete operations, these are asynchronous calls. And for read and list, those are synchronous, you get those responses immediately. Here, I can either use the get resource request status which is auxiliary API, or I can do a simple list call to see whether this resource exists anymore or not. And we should expect a null value which is rightfully so because the resource is deleted.

    So there are no more resources available, right? Now, this is the flow from create, update, read, delete, and list, right, just for CloudWatch log group. To showcase the consistency, I’m gonna use the same set of APIs, same set of input parameters to create an instantiate Amazon Kinesis stream resource. So let’s get straight into it. So to do that, I am going to use the Cloud Control API’s to create resource call and pass in the resource type name, and the desired state file. So I’m gonna use this API call as you saw is very similar to the CloudWatch log group, create resources, same API were same type name, sorry, type name is different, it’s Kinesis stream.

    But desired state file is what we saw in the configuration state, same input parameters. And an asynchronous call, you see that the operation is in progress. To see whether this was successfully created or not, you can easily check by calling the get resource request status or the auxiliary API. And let’s do that as we speak. We have the request token here return from this call.

    Let’s pass that in. And let’s see the, and you see, well, yes, the operation was successful. So bingo, our Kinesis stream is now ready and created. And now you can list this Kinesis stream in my resource by calling the list resources API. Just specify the Kinesis stream type name, and you will get the result, you will get the identifier.

    In this case, the identifier is my demo stream. Now one thing that you will notice is that the properties are different. In this case, you only get the identifier, whereas for CloudWatch log group, we got the properties as well. And that’s simply because in case of CloudWatch log group, there isn’t list API call. So the permissions for the list goes to the describe streams call which returns all the response parameters, right? So you see this here.

    And now if I want to update, suppose I want to update my Kinesis stream and add tags to them, right, I’m gonna use the same update resource call, same input parameters type name, and pass in the patch document, I will show you quickly what the patch document is in our case, but in the interim, I am going to invoke the call and let the progress events pass in, and I’m gonna showcase to you what the patch document is, in this case, it’s not replacing any existing property, but it’s actually adding the property associated with tags for Kinesis stream. I have specified the property value as tags, operation as addition and the value, key value pairs associated with the tag. Specify the key as environment and value as development. And let’s see the operation status on where we stand. We see the operation status is in progress and that you’re adding these tags.

    Let’s use the get resources or the read call to read this state of this Kinesis stream resource. You have the identifier passed and now you see that the properties returned are the Amazon Resource Name, the retention period hours that are specified as 168, the tags as well as the shard count, right. And now you can also delete this Kinesis stream using the delete resource API call. Same consistent input parameters, pass in the type name identifier, you get an asynchronous call. To check, you can use either the auxiliary API, which is the get resource request status, or you can also use the list call, in this for the purpose of this demo, I’m gonna use the list resource to see whether this resource exists or not, and rightfully so, this resource is deleted.

    So now you don’t see any identifier associated. So that’s for Kinesis, right? As you noticed, the create, read, update, delete and list APIs remain the same, the input parameters remain the same. And that showcases the consistency of Cloud Control API. I’m gonna just showcase one other resource type. In this case, I’m gonna use the Amazon ECS cluster.

    And I’m gonna use the create resource call. specify the type name and the desired state file, you see that it returns the familiar progress event that we talked about. And I can see whether this request was successful, as create request was successful or not by passing in the request status. And you see that the operation is successful. That means this ECS cluster has been spun up.

    And to list this ECS cluster I again, call the list resources API, specify the type name and get the result. In this case, the identifier as well as all the properties associated with it, right, like you see, these properties are null because I did not specify that in my desired state file, I just specify the cluster name, the Amazon Resource Name, the tags, as well as the container insights cluster settings, right. I can also read the state of the ECS cluster resource type by calling the read resource or the get resource. You specify the type name, same process, you get the details out and also you can delete your resource by passing in the AWS delete resource, Cloud Control delete resource API, and the identifier associated with this ECS cluster. Familiar progress event, and you can quickly check either using the get resource request status, or list resources.

    For this time, I’m gonna actually use the get resource request status and show you that experience as well. So I’m gonna take the request token, pass it here. And we see that the operation was successful, that means this delete is successful. So when I call the list resources, I should get a null value, which is rightfully so, as you see resource description is null. So with this, what I’m trying to showcase here is that these create, read, update, delete, and list call remain consistent for all the supported AWS and third party resources, which are over hundreds of AWS resources that are supported today on Cloud Control API, as well as over dozen third party resources.

    All of them, their entire lifecycle, end-to-end can be managed using Cloud Control APIs consistent set of APIs that we have exposed. So with this, I’m gonna conclude the demo. And I am going to showcase to you how you can get started. I’m really excited for everyone here who’s listening in, to get started building on Cloud Control API, as well as start using the Pulumi AWS Native Provider, which is built on top of Cloud Control API. So with this, I’m gonna switch over to my presentation and share some of the resources to get you started building.

    So let’s do that. To learn more about the product, how it works, and the FAQ, I would point you to the product page of Cloud Control API, aws.amazon.com/cloudcontrolapi. At the same time for you to get started using the Pulumi AWS Native Provider, which is built on top of the Cloud Control API, to leverage the latest AWS innovation, I would point you to the great blog written on how to get yourself started using the Pulumi AWS Native Provider by following this link.

    We can’t wait to get you started on the Cloud Control API journey. Thank you all so much for your time and giving us the opportunity to present this product, talk to you about how we took this from concept to roll out at scale and also the Pulumi AWS Native Provider which is built on top of Cloud Control API. Thank you, everyone.

Get started today

Pulumi is open source and free to get started. Deploy your first stack today.