Decoupled ML Model Inference Requests with AWS SQS

Question

Pulumi · Accepted Answer

To decouple machine learning (ML) model inference requests with AWS Simple Queue Service (SQS), the typical architecture involves placing an SQS queue between the client applications that generate the requests and the server-side components that process them. This allows for better scaling, reliability, and manageability of the ML inference workload.

In this setup:
- **Client applications** - These are the producers of inference requests. They create and send messages to an SQS queue.
- **SQS Queue** - Acts as a buffer and messaging system, decoupling the requests from the actual processing. This queue holds the messages until they are handled by the consumer.
- **Consumer** - These can be AWS Lambda functions, EC2 instances, or AWS SageMaker endpoints that pick up the messages from the queue and perform the ML inference.

Here is a Pulumi program in Python that sets up an AWS SQS queue that you could use for decoupling ML model inference requests:

```python
import pulumi
import pulumi_aws as aws

# Create an SQS queue for ML inference requests
ml_inference_queue = aws.sqs.Queue('mlInferenceQueue',
    delay_seconds=0,
    max_message_size=262144,  # Maximum message size is 256 KB
    message_retention_seconds=86400,  # Messages are retained for 1 day
    receive_wait_time_seconds=10,  # Long polling for message retrieval
    redrive_policy=aws.sqs.Queue("deadLetterQueue").arn.apply(lambda arn: pulumi.Output.secret(
        json.dumps({"maxReceiveCount": 4, "deadLetterTargetArn": arn})
    )),
    tags={
        "Environment": "Production",
        "Purpose": "DecoupledMLInference"
    }
)

# The ARN and URL of the queue are often needed for integration with consumers,
# so we export them here.
pulumi.export('ml_inference_queue_arn', ml_inference_queue.arn)
pulumi.export('ml_inference_queue_url', ml_inference_queue.id)
```

Explanation:

- We're creating an SQS Queue named `mlInferenceQueue` using the `aws.sqs.Queue` class ([AWS SQS Queue](https://www.pulumi.com/registry/packages/aws/api-docs/sqs/queue/)).
  
  - `delay_seconds` are set to 0, meaning messages are available for consumption immediately after they are sent to the queue.
  
  - `max_message_size` is set to the maximum size allowed by SQS (256 KB). This should be set as per the expected size of the inference request.
  
  - `message_retention_seconds` is set to 86400; each message will be retained in the queue for 1 day before it is automatically deleted if not processed.
  
  - `receive_wait_time_seconds` uses long polling (set to 10 seconds) to minimize the number of empty responses and thus reducing cost and improving efficiency of message retrieval.
  
  - `redrive_policy` integrates a dead-letter queue (DLQ) to handle message processing failures. The example assumes a DLQ (`deadLetterQueue`) has already been created. If a message fails to process multiple times, it is sent to the DLQ.
  
  - `tags` are metadata used to identify resources and their purpose or environment.

- We export the `arn` and `url` of the queue because these will be needed when setting up the consumer service that processes the messages.

With this setup, you can now have client applications send inference requests to the SQS queue, while consumer services pull these messages and execute ML inference based on the request data. Adding a DLQ ensures that messages that fail to be processed do not get lost but are captured for further analysis and handling.