Real-Time Inference Endpoint for LLMs with API Gateway

Question

Pulumi · Accepted Answer

To create a real-time inference endpoint for Large Language Models (LLMs) using Amazon API Gateway, we'll set up several AWS resources. The setup allows for an HTTP endpoint that can receive requests and forward them to a backend service, which could be an AWS Lambda function or another type of integration capable of processing the request and providing the inference result.

The main resources we will use are:
1. **AWS API Gateway**: To create an HTTP endpoint for clients to make inference requests.
2. **AWS Lambda Function**: To handle the requests and perform the inference using LLMs.
3. **IAM Role and Policy**: To grant necessary permissions for the Lambda function to interact with other AWS services, if needed.
4. **API Gateway Integration**: To connect the API Gateway to the Lambda function.

Here's a brief overview of how these resources interact:

- Clients send inference requests to the API Gateway endpoint.
- API Gateway routes the requests to the attached Lambda function.
- The Lambda function performs the computation using the LLM and returns the results.
- API Gateway forwards the Lambda response back to the client.

Below is the Pulumi program in Python that sets up such an infrastructure:

```python
import pulumi
import pulumi_aws as aws

# Create an IAM role which will be used by Lambda Function
lambda_role = aws.iam.Role("lambdaRole",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            }
        }]
    }""")

# Attach the AWS Lambda Basic Execution Role policy to the role created
lambda_policy_attachment = aws.iam.RolePolicyAttachment("lambdaPolicyAttachment",
    role=lambda_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole")

# Create a Lambda function, assuming the code for handling the LLM is defined in `llm_handler.py`
llm_lambda = aws.lambda_.Function("llmLambda",
    role=lambda_role.arn,
    handler="llm_handler.handler",
    runtime="python3.8",
    code=pulumi.FileArchive("./llm_handler.zip"))

# Create the API Gateway for a REST API
api_gateway = aws.apigateway.RestApi("apiGateway",
    description="Endpoint for Real-Time LLM Inference")

# Create a resource to attach to the API Gateway, e.g., 'predict'
predict_resource = aws.apigateway.Resource("predictResource",
    rest_api=api_gateway.id,
    parent_id=api_gateway.root_resource_id,
    path_part="predict")

# Create a method for the 'predict' resource, e.g., POST method
predict_method = aws.apigateway.Method("predictMethod",
    rest_api=api_gateway.id,
    resource_id=predict_resource.id,
    http_method="POST",
    authorization="NONE")

# Integrate the Lambda function with the 'predict' method
predict_integration = aws.apigateway.Integration("predictIntegration",
    rest_api=api_gateway.id,
    resource_id=predict_resource.id,
    http_method=predict_method.http_method,
    integration_http_method="POST",
    type="AWS_PROXY", # Proxy integration
    uri=llm_lambda.invoke_arn)

# Deploy the API Gateway to make it accessible over the internet
api_deployment = aws.apigateway.Deployment("apiDeployment",
    rest_api=api_gateway.id,
    # The stage name e.g., 'prod', associated with the deployment for API gateway.
    stage_name="prod",
    depends_on=[predict_integration])

# Output the HTTP endpoint URL
pulumi.export("endpoint_url", api_gateway.deployment_invoke_url.apply(lambda url: f"{url}{predict_resource.path_part}"))
```

In this program, we begin by setting up an IAM role for the Lambda function and attach the basic execution policy to it. This provides permissions for the function to run and log to Amazon CloudWatch.

Next, we define the Lambda function that will run the inference code. The handler and source of the function are part of the `llm_handler.zip` file, which should contain your LLM inference code. You will need to write this handler function in `llm_handler.py` and zip it with any dependencies before deploying with Pulumi.

We then create a new REST API in API Gateway with a POST method on the path `/predict`, which will be our endpoint for the LLM prediction. The `AWS_PROXY` integration type ensures that the request and response data is passed directly to the Lambda function.

Finally, we create a deployment for the API Gateway to make it accessible over the internet and output the URL endpoint for the clients to use.

To use this program:
- Ensure you have AWS credentials configured for Pulumi, either by setting environment variables or using the AWS CLI.
- Replace the `llm_handler.zip` with your actual Lambda function code.
- Install the required Pulumi AWS package by running `pip install pulumi_aws`.
- Run `pulumi up` to deploy the resources to AWS.

Once deployed, you can test the endpoint URL with HTTP clients such as `curl` or Postman by sending a POST request with necessary input for the LLM inference. The Lambda function will process the request and return the real-time inference result.