Large Language Model Inference Request Queueing on Azure Service Bus

Question

Pulumi · Accepted Answer

In order to set up an infrastructure for large language model inference request queueing, you will typically require a message queuing system that can handle a potentially large number of incoming inference requests, and then process them asynchronously. Azure Service Bus is a robust messaging system that is suitable for this kind of workload, as it allows you to create queues and topics to handle messages that can be processed by backend systems.

In the given solution, we will create:
1. A Service Bus Namespace, which is a container for all messaging components. Namespaces provide a unique scoping container, within which you can create queues, topics, and subscriptions.
2. A Service Bus Queue, where the inference requests will be sent. This queue can be polled by a backend service that processes the requests and can scale out to handle load as needed.

Here's a Pulumi program that sets up an Azure Service Bus Namespace and Queue in Python:

```python
import pulumi
import pulumi_azure_native as azure_native

# Define the resource group to contain the Service Bus resources
resource_group = azure_native.resources.ResourceGroup("myResourceGroup")

# Create an Azure Service Bus Namespace
service_bus_namespace = azure_native.servicebus.Namespace("myServiceBusNamespace",
                                                          resource_group_name=resource_group.name,
                                                          location=resource_group.location,
                                                          sku=azure_native.servicebus.SBSkuArgs(
                                                              name=azure_native.servicebus.SkuName.STANDARD
                                                          ))

# Create a Service Bus Queue within the namespace
service_bus_queue = azure_native.servicebus.Queue("myServiceBusQueue",
                                                  resource_group_name=resource_group.name,
                                                  namespace_name=service_bus_namespace.name)

# Export the primary connection string of the namespace to use with applications
primary_connection_string = pulumi.Output.all(resource_group.name, service_bus_namespace.name).apply(
    lambda args: azure_native.servicebus.list_namespace_keys(ListNamespaceKeysArgs(
        resource_group_name=args[0],
        namespace_name=args[1],
        authorization_rule_name="RootManageSharedAccessKey"
    ))
).apply(lambda result: result.primary_connection_string)

# Export the queue name to use with applications
pulumi.export("serviceBusQueueName", service_bus_queue.name)
# Export the primary connection string to connect to the Service Bus Namespace
pulumi.export("primaryConnectionString", primary_connection_string)
```

In this program:
- We first create a Resource Group using `azure_native.resources.ResourceGroup` which will contain all of our Azure resources.
- We then set up a Service Bus Namespace using `azure_native.servicebus.Namespace`, specifying the `STANDARD` SKU, which offers features like duplicate detection, sessions, dead-lettering, etc.
- Within this namespace, we provision a Queue with `azure_native.servicebus.Queue`. You can customize its parameters (like the message time-to-live, maximum size, etc.) based on the requirements of the model's inference workload.
- Finally, we output the namespace's primary connection string and the queue's name, which are necessary to configure client applications to send inference requests to the queue.

When this Pulumi program is deployed, it will create the necessary Azure infrastructure to queue large language model inference requests, ready for a backend process to consume and handle the workload.