Performance Monitoring Extensions on AI Workload VMs

Question

Pulumi · Accepted Answer

In order to set up performance monitoring extensions on AI workload VMs with Azure, you will be interacting primarily with the Azure Virtual Machines (VM) and the Azure Monitoring services. Pulumi is particularly adept at orchestrating these complex cloud setups due to its abstractions that make declarative infrastructure as code possible.

Here's how you would set this up in Python with Pulumi:

You would instantiate a VM or a set of VMs that run your AI workloads. This is done using the VirtualMachine resource.
You would then apply monitoring extensions to these VMs, which can be done with the VirtualMachineExtension resource.
To capture and analyze the metrics from these extensions, you might store them in an instance of Operational Insights, represented by the LogAnalyticsWorkspace resource in Pulumi, and visualize them using Azure's monitoring tools.

Let's walk through a Pulumi program in Python that sets this up:

import pulumi
import pulumi_azure_native as azure_native

# Step 1: Define your AI workload virtual machine(s)
# This can be an existing VM or a new one defined using Pulumi. We'll define a new one for illustration.
# Note: the following code assumes you have already set up the necessary networking infrastructure,
# like virtual network and subnet.

# Replace these placeholder values with your own desired configuration
vm_name = "ai-workload-vm"
vm_resource_group_name = "my-resource-group"
vm_location = "eastus"
vm_size = "Standard_DS1_v2"  # Example VM size
admin_username = "adminuser"
admin_password = "SecurePassword123"  # Be sure to replace this with a secure password.

# Create an Azure Virtual Machine for the AI workload 
ai_vm = azure_native.compute.VirtualMachine(f"{vm_name}",
    resource_group_name=vm_resource_group_name,
    location=vm_location,
    vm_size=vm_size,
    os_profile=azure_native.compute.OSProfileArgs(
        admin_username=admin_username,
        admin_password=admin_password,
        computer_name=vm_name,
    ),
    network_profile=azure_native.compute.NetworkProfileArgs(
        # Be sure to adjust the network configurations based on your actual resources
        network_interfaces=[
            azure_native.compute.NetworkInterfaceReferenceArgs(
                id="/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<network-interface-name>",
                primary=True,
            ),
        ],
    ),
    # Define the image to use based on your requirements. This is an example for an Ubuntu Server image.
    storage_profile=azure_native.compute.StorageProfileArgs(
        image_reference=azure_native.compute.ImageReferenceArgs(
            publisher="Canonical",
            offer="UbuntuServer",
            sku="18.04-LTS",
            version="latest",
        ),
        # Define your disk size, type, and other configurations here based on your requirements
        # ...
    ),
)

# Step 2: Apply performance monitoring extensions on the AI workload VM
monitor_extension = azure_native.compute.VirtualMachineExtension(f"{vm_name}-diag-ext",
    resource_group_name=vm_resource_group_name,
    vm_name=ai_vm.name,
    # You can choose different monitoring extensions. This is an example of using the diagnostic settings extension
    publisher="Microsoft.Azure.Diagnostics",
    type="IaaSDiagnostics",
    type_handler_version="1.5",
    settings={
        "xmlCfg": "<PublicConfig><...></PublicConfig>",
        "storageAccount": "<storage-account-name>",
        # Modify the above configuration according to your diagnostic settings.
        # More info on IaaSDiagnostics settings: https://aka.ms/AzureDiagnosticsVmExtension
        # ...
    },
    # You may supply protected settings such as storage account keys in this field.
    # protected_settings={
    #     "storageAccountKey": "<storage-account-key>",
    #     "storageAccountEndPoint": "https://core.windows.net/",
    # },
)

# Step 3: Set up Operational Insights for performance metrics logging and analysis
# You would do this if you need to create a new workspace. If you already have one, you can skip this part.
log_analytics_workspace = azure_native.operationalinsights.Workspace(f"{vm_name}-logs",
    resource_group_name=vm_resource_group_name,
    location=vm_location,
    # Add your specific workspace configuration here
)

# Now you can use Azure Monitor and associated tools such as Azure Dashboards, 
# Azure Log Analytics, or even third-party SIEM tools to set up alerts,
# dashboards and perform further analysis on your AI VMs performance metrics.

# Step 4: Export useful data such as VM IDs, Log Analytics Workspace ID etc.
pulumi.export("ai_vm_id", ai_vm.id)
pulumi.export("monitor_extension_id", monitor_extension.id)
pulumi.export("log_analytics_workspace_id", log_analytics_workspace.id)

In the code above:

We create a Virtual Machine which is intended to be used for AI workloads using the azure_native.compute.VirtualMachine resource class.
We then apply the diagnostic extension to the VM using the azure_native.compute.VirtualMachineExtension. The diagnostics extension collects monitoring logs and metrics for the operation of your VMs.
We optionally create a Log Analytics Workspace to ingest and query the diagnostics data using the azure_native.operationalinsights.Workspace resource. This setup would be complete with an existing workspace.

Please ensure to replace placeholders (e.g., <...>) with real values from your Azure subscription, such as network interface names, subscription ID, and storage account information.

To use this code, you would need to have the Pulumi CLI installed and configured access to your Azure subscription. With Pulumi set up, you can put this script in a file (say monitor_ai_vm.py), and run pulumi up to provision the resources.

Remember to look at Azure specific documentation for the exact details of XML configuration for the diagnostics extension if you're using the IaaSDiagnostics type. The Pulumi documentation for the Azure Native provider can also be referenced for more details on the various resource classes used here.