Low-Latency LLM Inference with Edge Locations in US-East-1

Question

Pulumi · Accepted Answer

To achieve low-latency large language model (LLM) inference with edge locations in the AWS US-East-1 region, you would typically set up an inference endpoint that can process the machine learning model's inferences efficiently. AWS provides services like AWS Lambda and Amazon SageMaker to host the inference logic, which can be placed closer to end-users with the help of Amazon CloudFront's edge locations. However, this usually involves setting up an API Gateway endpoint to trigger the Lambda functions, which in turn interact with the SageMaker endpoint or the deployed model.

Let's implement an infrastructure that supports this, using Pulumi with AWS services.

Below is a Pulumi program written in Python that demonstrates how to deploy a SageMaker endpoint for model inference, a Lambda function for preprocessing and postprocessing, and an API Gateway to expose the Lambda function. The request flow would be: Client → API Gateway → Lambda (preprocessing) → SageMaker (inference) → Lambda (postprocessing) → Client.

This program does the following:
1. Creates an Amazon SageMaker model, which points to the pre-trained model data.
2. Deploys the model to a SageMaker endpoint configuration with an instance type chosen for inference.
3. Creates the SageMaker endpoint where inference requests can be sent.
4. Sets up an AWS Lambda function, which will handle the incoming requests and postprocess the responses. This Lambda function can be deployed to multiple locations using Lambda@Edge to decrease latency.
5. Initializes an API Gateway to trigger the Lambda function.

### Pulumi Program for Low-Latency Inference

```python
import pulumi
import pulumi_aws as aws

# Create a SageMaker model by pointing to the pre-trained model data.
sagemaker_model = aws.sagemaker.Model("llmModel",
    execution_role_arn="arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001", # Replace with your SageMaker execution role ARN
    primary_container={
        "image": "174872318107.dkr.ecr.us-east-1.amazonaws.com/kmeans:1", # Replace with the image of the LLM model
        "model_data_url": "s3://my-bucket/pretrained-llm-model-data", # Path to your pretrained model data
    })

# Deploy the model to a SageMaker endpoint configuration
endpoint_config = aws.sagemaker.EndpointConfiguration("llmEndpointConfig",
    production_variants=[{
        "instanceType": "ml.m5.large",
        "modelName": sagemaker_model.name,
        "variantName": "VariantOne",
        "initialInstanceCount": 1,
    }])

# Create the SageMaker endpoint where inference requests can be sent
sagemaker_endpoint = aws.sagemaker.Endpoint("llmEndpoint",
    endpoint_config_name=endpoint_config.name)

# Define the AWS Lambda function that will preprocess requests and invoke the SageMaker endpoint
with open('inference_lambda_handler.py', 'r') as lambda_handler_file:
    lambda_handler_code = lambda_handler_file.read()

# Create a Lambda function to process and forward requests to the SageMaker endpoint
lambda_function = aws.lambda_.Function("inferenceLambdaFunction",
    code=pulumi.StringAsset(lambda_handler_code),
    role="arn:aws:iam::123456789012:role/lambda_execution_role", # Replace with your Lambda execution role ARN
    handler="inference_lambda_handler.handler",
    runtime="python3.8",
    environment={
        "variables": {
            "SAGEMAKER_ENDPOINT_NAME": sagemaker_endpoint.name,
        },
    },
    timeout=10 # Provide adequate timeout for processing and SageMaker communication
)

# Use API Gateway to expose the Lambda function as a REST API endpoint
api_gateway = aws.apigatewayv2.Api("inferenceApi",
    protocol_type="HTTP",
    route_key="POST /invoke-llm",
    target=lambda_function.invoke_arn)

# Export the API Gateway endpoint URL
pulumi.export('api_url', api_gateway.api_endpoint)
```

In this program, replace the placeholders such as execution role ARNs and model data URLs with valid AWS resources that correspond to your deployment.

Before running this Pulumi program, you would first need to create an AWS Lambda execution role that grants your Lambda function the necessary permissions to invoke SageMaker endpoints and log to CloudWatch Logs. Likewise, you should set up the SageMaker execution role to have access to the specified S3 bucket and necessary policies to run SageMaker jobs.

Place the logic that handles the incoming request and invokes the SageMaker endpoint within an `inference_lambda_handler.py` file located alongside your Pulumi program.

To execute this Pulumi program:
1. Install Pulumi and configure your AWS credentials.
2. Create a new directory, save the above program as `__main__.py`, and place your Lambda handler code in `inference_lambda_handler.py`.
3. Run `pulumi up` in the directory to create the resources.
4. Once the Pulumi process is complete, test the API Gateway endpoint by sending inference requests.