Serverless AI Inference Endpoints via Azure Native

Question

Pulumi · Accepted Answer

To create serverless AI inference endpoints in Azure, you'll be leveraging the Azure Machine Learning services. Azure Machine Learning provides various resources which can be used to train models, deploy them as web services, and create endpoints for efficiently running inferencing jobs (making predictions using a trained model).

Here are several key components of such a serverless setup:

Workspace: The foundational component of the Azure Machine Learning service that provides a space where you can collaborate, manage, experiment, and track machine learning work.
Online Endpoints: These are the HTTP(s) endpoints which are used to deploy your ML models as services to receive data, run predictions, and return results.
Deployments: Within the online endpoints, you have one or more deployments which represent versions or variants of the service that is running the model.
Models: The machine learning models that you have trained. These are usually registered in your Azure Machine Learning workspace.

In a Pulumi Python program, you define the desired state of your infrastructure using classes and methods provided by the Pulumi SDK for Azure. Pulumi will take care of provisioning all resources with their dependencies and setting up a serverless AI inference endpoint to run predictions as a service.

Here is a basic Pulumi Python program that creates a serverless AI inference endpoint using the Azure Machine Learning service:

import pulumi
import pulumi_azure_native as azure_native

# Define resource group
resource_group = azure_native.resources.ResourceGroup('rg')

# Define Azure Machine Learning workspace
aml_workspace = azure_native.machinelearningservices.Workspace(
    "amlWorkspace",
    resource_group_name=resource_group.name,
    location=resource_group.location,
    sku=azure_native.machinelearningservices.SkuArgs(
        name="Enterprise",
    ),
    identity=azure_native.machinelearningservices.IdentityArgs(
        type="SystemAssigned",
    )
)

# Define serverless endpoint
serverless_endpoint = azure_native.machinelearningservices.OnlineEndpoint(
    "serverlessEndpoint",
    resource_group_name=resource_group.name,
    location=resource_group.location,
    workspace_name=aml_workspace.name,
    properties=azure_native.machinelearningservices.OnlineEndpointType(
        public_network_access="Enabled"
    )
)

# Define online deployment
online_deployment = azure_native.machinelearningservices.OnlineDeployment(
    "onlineDeployment",
    endpoint_name=serverless_endpoint.name,
    resource_group_name=resource_group.name,
    workspace_name=aml_workspace.name,
    # Note: The following properties would include details about your inference configuration,
    # such as the model you're deploying and any environment settings needed. Replace
    # <model-id> and other placeholders with actual details from your Azure ML setup.
    properties=azure_native.machinelearningservices.OnlineDeploymentType(
        model=azure_native.machinelearningservices.ModelType(
            properties=azure_native.machinelearningservices.ModelTypePropertiesArgs(
                # Model id from the registered model
                id="<model-id>",
                # Configuration for scoring script and environment
                code_configuration=azure_native.machinelearningservices.CodeConfigurationArgs(
                    code_id="<code-id>",
                    scoring_script="<scoring_file_name.py>"
                ),
                # Environment configuration if any
                environment_id="<environment-id>"
            )
        ),
        # Endpoint routing and traffic weight configuration (e.g., for A/B testing)
        endpoint_compute_type="Managed"
    )
)

# Export key resource information
pulumi.export('online_endpoint', serverless_endpoint.url)
pulumi.export('deployment_url', online_deployment.url)

In the program above, you create a resource group (container that holds related resources for an Azure solution) and an Azure Machine Learning workspace – a critical component where all the ML assets are managed, including compute resources, models, deployments, etc.

The OnlineEndpoint resource defines the details of your serverless endpoint. Here, the public_network_access is set to "Enabled", which means the endpoint is accessible over the internet.

The OnlineDeployment resource involves a more complex setup including the model registry ID (<model-id>), code ID (<code-id>), scoring script file name (<scoring_file_name.py>), and additional properties such as the environment ID if applicable. You'll need to replace these placeholders with actual values obtained from your Azure ML workspace.

Finally, you export the URLs for your serverless AI inference endpoint and the deployment so that you can easily access them once the deployment is complete.

Please ensure that you replace the placeholder values with your actual model, code, and environment identifiers. These values are typically created and managed within the Azure Machine Learning workspace using either the Azure portal or Azure ML SDK prior to deploying them through Pulumi.