1. Decoupled ML Model Inference Requests with AWS SQS


    To decouple machine learning (ML) model inference requests with AWS Simple Queue Service (SQS), the typical architecture involves placing an SQS queue between the client applications that generate the requests and the server-side components that process them. This allows for better scaling, reliability, and manageability of the ML inference workload.

    In this setup:

    • Client applications - These are the producers of inference requests. They create and send messages to an SQS queue.
    • SQS Queue - Acts as a buffer and messaging system, decoupling the requests from the actual processing. This queue holds the messages until they are handled by the consumer.
    • Consumer - These can be AWS Lambda functions, EC2 instances, or AWS SageMaker endpoints that pick up the messages from the queue and perform the ML inference.

    Here is a Pulumi program in Python that sets up an AWS SQS queue that you could use for decoupling ML model inference requests:

    import pulumi import pulumi_aws as aws # Create an SQS queue for ML inference requests ml_inference_queue = aws.sqs.Queue('mlInferenceQueue', delay_seconds=0, max_message_size=262144, # Maximum message size is 256 KB message_retention_seconds=86400, # Messages are retained for 1 day receive_wait_time_seconds=10, # Long polling for message retrieval redrive_policy=aws.sqs.Queue("deadLetterQueue").arn.apply(lambda arn: pulumi.Output.secret( json.dumps({"maxReceiveCount": 4, "deadLetterTargetArn": arn}) )), tags={ "Environment": "Production", "Purpose": "DecoupledMLInference" } ) # The ARN and URL of the queue are often needed for integration with consumers, # so we export them here. pulumi.export('ml_inference_queue_arn', ml_inference_queue.arn) pulumi.export('ml_inference_queue_url', ml_inference_queue.id)


    • We're creating an SQS Queue named mlInferenceQueue using the aws.sqs.Queue class (AWS SQS Queue).

      • delay_seconds are set to 0, meaning messages are available for consumption immediately after they are sent to the queue.

      • max_message_size is set to the maximum size allowed by SQS (256 KB). This should be set as per the expected size of the inference request.

      • message_retention_seconds is set to 86400; each message will be retained in the queue for 1 day before it is automatically deleted if not processed.

      • receive_wait_time_seconds uses long polling (set to 10 seconds) to minimize the number of empty responses and thus reducing cost and improving efficiency of message retrieval.

      • redrive_policy integrates a dead-letter queue (DLQ) to handle message processing failures. The example assumes a DLQ (deadLetterQueue) has already been created. If a message fails to process multiple times, it is sent to the DLQ.

      • tags are metadata used to identify resources and their purpose or environment.

    • We export the arn and url of the queue because these will be needed when setting up the consumer service that processes the messages.

    With this setup, you can now have client applications send inference requests to the SQS queue, while consumer services pull these messages and execute ML inference based on the request data. Adding a DLQ ensures that messages that fail to be processed do not get lost but are captured for further analysis and handling.