Decoupled ML Model Inference Requests with AWS SQS
PythonTo decouple machine learning (ML) model inference requests with AWS Simple Queue Service (SQS), the typical architecture involves placing an SQS queue between the client applications that generate the requests and the server-side components that process them. This allows for better scaling, reliability, and manageability of the ML inference workload.
In this setup:
- Client applications - These are the producers of inference requests. They create and send messages to an SQS queue.
- SQS Queue - Acts as a buffer and messaging system, decoupling the requests from the actual processing. This queue holds the messages until they are handled by the consumer.
- Consumer - These can be AWS Lambda functions, EC2 instances, or AWS SageMaker endpoints that pick up the messages from the queue and perform the ML inference.
Here is a Pulumi program in Python that sets up an AWS SQS queue that you could use for decoupling ML model inference requests:
import pulumi import pulumi_aws as aws # Create an SQS queue for ML inference requests ml_inference_queue = aws.sqs.Queue('mlInferenceQueue', delay_seconds=0, max_message_size=262144, # Maximum message size is 256 KB message_retention_seconds=86400, # Messages are retained for 1 day receive_wait_time_seconds=10, # Long polling for message retrieval redrive_policy=aws.sqs.Queue("deadLetterQueue").arn.apply(lambda arn: pulumi.Output.secret( json.dumps({"maxReceiveCount": 4, "deadLetterTargetArn": arn}) )), tags={ "Environment": "Production", "Purpose": "DecoupledMLInference" } ) # The ARN and URL of the queue are often needed for integration with consumers, # so we export them here. pulumi.export('ml_inference_queue_arn', ml_inference_queue.arn) pulumi.export('ml_inference_queue_url', ml_inference_queue.id)
Explanation:
-
We're creating an SQS Queue named
mlInferenceQueue
using theaws.sqs.Queue
class (AWS SQS Queue).-
delay_seconds
are set to 0, meaning messages are available for consumption immediately after they are sent to the queue. -
max_message_size
is set to the maximum size allowed by SQS (256 KB). This should be set as per the expected size of the inference request. -
message_retention_seconds
is set to 86400; each message will be retained in the queue for 1 day before it is automatically deleted if not processed. -
receive_wait_time_seconds
uses long polling (set to 10 seconds) to minimize the number of empty responses and thus reducing cost and improving efficiency of message retrieval. -
redrive_policy
integrates a dead-letter queue (DLQ) to handle message processing failures. The example assumes a DLQ (deadLetterQueue
) has already been created. If a message fails to process multiple times, it is sent to the DLQ. -
tags
are metadata used to identify resources and their purpose or environment.
-
-
We export the
arn
andurl
of the queue because these will be needed when setting up the consumer service that processes the messages.
With this setup, you can now have client applications send inference requests to the SQS queue, while consumer services pull these messages and execute ML inference based on the request data. Adding a DLQ ensures that messages that fail to be processed do not get lost but are captured for further analysis and handling.