1. Isolation of Azure AI Training Environments with NSGs


    Isolating Azure AI Training Environments involves setting up dedicated virtual networks (VNet) where your computational resources, such as Virtual Machine Scale Sets (VMSS), reside. To enforce the isolation, you should apply Network Security Groups (NSG) - a critical feature in Azure that allows you to configure inbound and outbound security rules to regulate the traffic to and from resources in a VNet.

    In the context of Pulumi and infrastructure as code (IaC), you can define these virtual networks, VMs or VMSS, and NSGs using Python to create a secure and isolated environment for AI training.

    The following Pulumi Python program demonstrates how to:

    1. Set up a virtual network.
    2. Create a subnet within that VNet.
    3. Define a network security group that only allows specific traffic.
    4. Associate that NSG with the subnet.
    5. Create a VMSS in that subnet.

    This structure ensures that the AI training environment is isolated, and only the traffic defined by NSG rules can enter or leave the environment, increasing security and potentially reducing interference from other network traffic.

    Here's how you'll structure the Pulumi program:

    • Use the azure-native provider, which gives you access to Azure resources with Pythonic classes.
    • Create a Virtual Network using the VirtualNetwork class from the azure-native.network module.
    • Within the VNet, define a Subnet using the Subnet class from the azure-native.network module. This subnet is where your training VMs will reside.
    • Create a Network Security Group using the NetworkSecurityGroup class from the azure-native.network module and define the security rules that apply to the subnet.
    • Associate the NSG to your subnet using the SubnetNetworkSecurityGroupAssociation class from the azure-native.network module.
    • Generate a Virtual Machine Scale Set within the subnet using the VirtualMachineScaleSet class from the azure-native.compute module.

    Let's start with the Pulumi program written in Python:

    import pulumi from pulumi_azure_native import network from pulumi_azure_native import compute from pulumi_azure_native import resources # Create an Azure Resource Group resource_group = resources.ResourceGroup('ai_training_rg') # Create an Azure Virtual Network where the training environment will reside vnet = network.VirtualNetwork( 'training_vnet', resource_group_name=resource_group.name, address_space=network.AddressSpaceArgs( address_prefixes=[''], ), location=resource_group.location, ) # Create a subnet within the Virtual Network subnet = network.Subnet( 'training_subnet', resource_group_name=resource_group.name, virtual_network_name=vnet.name, address_prefix='', ) # Create a Network Security Group (NSG) for securing the training subnet nsg = network.NetworkSecurityGroup( 'training_nsg', resource_group_name=resource_group.name, location=resource_group.location, security_rules=[ network.SecurityRuleArgs( name='ALLOW_SSH', access=network.SecurityRuleAccess.ALLOW, direction=network.SecurityRuleDirection.INBOUND, protocol=network.SecurityRuleProtocol.TCP, priority=100, source_address_prefix='*', source_port_range='*', destination_address_prefix='*', destination_port_range='22', # To allow SSH access ), # Add additional rules as needed ], ) # Associate the NSG to the subnet subnet_nsg_association = network.SubnetNetworkSecurityGroupAssociation( 'subnet_nsg_assoc', resource_group_name=resource_group.name, subnet_name=subnet.name, network_security_group_id=nsg.id, virtual_network_name=vnet.name, ) # Create a Virtual Machine Scale Set in the subnet vmss = compute.VirtualMachineScaleSet( 'training_vmss', resource_group_name=resource_group.name, location=resource_group.location, sku=compute.SkuArgs( name='Standard_DS1_v2', # Choose the VM size as required for AI training tier='Standard', capacity=3, # Number of VM instances ), virtual_machine_profile=compute.VirtualMachineScaleSetVMProfileArgs( network_profile=compute.VirtualMachineScaleSetNetworkProfileArgs( network_interface_configurations=[compute.VirtualMachineScaleSetNetworkConfigurationArgs( name='vmss_nic', primary=True, ip_configurations=[compute.VirtualMachineScaleSetIPConfigurationArgs( name='vmss_ip_config', subnet=compute.ApiEntityReferenceArgs( id=subnet.id ), )], )], ), os_profile=compute.VirtualMachineScaleSetOSProfileArgs( computer_name_prefix='ai-vm', admin_username='adminuser', # Omitting admin_password for security reasons; use SSH keys or other authentication methods ), # Omitting storage_profile and other optional settings for brevity ), ) # Export the VMSS ID as an output pulumi.export('vmss_id', vmss.id)
    • This program uses the API version 2.11.0, which might be different from the current version at your time of usage. Ensure to update the version as needed.
    • Remember to replace the placeholder values like 'adminuser' with the actual values you wish to use.
    • For admin passwords and other sensitive data, you should use secrets in Pulumi.
    • Each resource creation step is labeled with comments, and each resource is created with the necessary arguments, such as resource_group_name, location, and specific Azure resource properties.
    • NSG rules, like ALLOW_SSH, are defined within the NSG resource. Customize these rules to match your AI training environment's security requirements.
    • We associate the NSG with the subnet where the VMSS will reside.
    • In VMSS, we provide a VM profile which includes a network profile that uses the created subnet.
    • At the end, we export the VMSS ID as an output, which could then be used to fetch information about the VMSS instances at a later point.

    This program will create an isolated environment suitable for AI training with controlled access defined by NSG rules. Using Pulumi's infrastructure as code approach, managing and provisioning Azure resources becomes more maintainable, versionable, and reproducible.