Scaling ML Inference with API Gateway

Question

Pulumi · Accepted Answer

To scale Machine Learning (ML) inference, it is common to deploy a scalable, serverless endpoint that can handle a large number of inference requests. An API gateway serves as the entry point for these requests, directing them to the backend service that runs the inference. Pulumi provides the ability to define this infrastructure as code, which is easy to deploy, manage, and scale.

In this guide, I will walk you through the setup of an AWS API Gateway that directs incoming inference requests to AWS Lambda, where an ML model is hosted. The use of AWS Lambda allows for scaling the number of inference executions, based on the demand.

Here's a breakdown of what we will do:
1. Set up an AWS Lambda function that will run the ML inference code.
2. Create an AWS API Gateway REST API to manage and route the requests to the Lambda function.
3. Define the necessary permissions so that API Gateway can invoke the Lambda function.

### Detailed Explanation of Resources Used

- **AWS Lambda Function**: AWS Lambda will host our ML inference code. Each request to the Lambda function can be a new inference execution. AWS Lambda is highly scalable and can handle a large number of concurrent executions.
- **AWS API Gateway (REST API)**: API Gateway is used as the front-door to receive HTTP requests and route them to various backend services like AWS Lambda. It is ideal for setting up a REST API to handle ML inference requests.
- **IAM Role and Policy**: AWS Identity and Access Management (IAM) roles and policies are used to grant the necessary permissions to the AWS services that we used. The API Gateway will need proper permissions to invoke the Lambda function.

Let's start by writing the Pulumi program in Python:

```python
import pulumi
import pulumi_aws as aws

# Step 1: Define the AWS Lambda function
# The function will require you to provide a ZIP file containing your ML model and inference code.
# The handler is the function within your code that will process the incoming API requests.

# Create an IAM role that AWS Lambda will assume
lambda_role = aws.iam.Role("lambdaRole",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            }
        }]
    }""")

# Attach the AWS managed LambdaFullAccess policy to the role so that the Lambda function can write logs
policy_attachment = aws.iam.RolePolicyAttachment("lambdaRolePolicyAttachment",
    role=lambda_role.name,
    policy_arn=aws.iam.ManagedPolicy.LAMBDA_FULL_ACCESS)

# Define the AWS Lambda Function
ml_lambda = aws.lambda_.Function("mlInferenceFunction",
    code=pulumi.FileArchive("./lambda.zip"), # Replace with path to the ZIP file for your code
    handler="index.handler",  # Replace with your handler function
    role=lambda_role.arn,
    runtime="python3.8")  # Use the appropriate runtime for your ML inference code

# Step 2: Create the API Gateway REST API
rest_api = aws.apigateway.RestApi("mlInferenceApi",
    description="API for ML Inference")

resource = aws.apigateway.Resource("mlInferenceResource",
    rest_api=rest_api.id,
    parent_id=rest_api.root_resource_id,
    path_part="inference")

# Define the API Gateway method to connect the HTTP endpoint with the Lambda function
api_method = aws.apigateway.Method("mlInferenceMethod",
    rest_api=rest_api.id,
    resource_id=resource.id,
    http_method="POST",
    authorization="NONE")

# Define the integration between the API Gateway method and the Lambda function
integration = aws.apigateway.Integration("mlInferenceIntegration",
    rest_api=rest_api.id,
    resource_id=resource.id,
    http_method=api_method.http_method,
    integration_http_method="POST",
    type="AWS_PROXY",
    uri=ml_lambda.invoke_arn)

# Step 3: Define permissions for API Gateway to invoke Lambda
permission = aws.lambda_.Permission("apiGatewayLambdaPermission",
    action="lambda:InvokeFunction",
    function=ml_lambda.name,
    principal="apigateway.amazonaws.com",
    source_arn=pulumi.Output.concat(rest_api.execution_arn, "/*/*"))

# Step 4: Deploy the API Gateway
deployment = aws.apigateway.Deployment("mlInferenceDeployment",
    rest_api=rest_api.id,
    stage_name="prod")

# Export the HTTPS endpoint of the ML Inference API
pulumi.export("api_endpoint", pulumi.Output.concat("https://", deployment.invoke_url, "prod/inference"))
```

In this Pulumi program, we start by creating an IAM role for our Lambda function, defining the function itself, and then setting up the API Gateway REST API with a POST method to receive inference requests. Once the method, resource, and integration are set up, we grant the necessary permissions for API Gateway to invoke the Lambda function. Finally, we deploy our REST API and output the endpoint URL.

Make sure to replace the placeholder `"./lambda.zip"` with the path to your actual Lambda deployment package and `index.handler` with your specific handler identifier.

This setup allows for a scalable ML inference API that can adapt to varying loads and only charges for the compute time used, making it cost-effective. With Pulumi's infrastructure as code approach, managing and replicating this infrastructure becomes reliable and straightforward.